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/TENSIONS AND SCHOOL ACHIEVEMENT EXAMINATIONS 


INTRODUCTION 


” Experientially there is reason to believe 
a are significant differences in ten- 

s felt before and during an examination. 
perimentally there has been little objective 
intification of those individuals who have 
igh or low tensions difected toward, or asso- 
ted with, the classroom examination. It is 
that there have been studies which show 
examination tensions do exist, but few 
these investigations have attempted to 
certain the. degree of tension, or to desig- 
ie individuals studied as belonging to a 
gh tension” group or a “low tension” 
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The use of the word “tension” seems more 
propriate to describe the affective state of 
individual in certain situations (for ex- 
mple, taking or about to take an examina- 
on, applying for a new position, about to 
D upon the lecture platform) than does the 
ord “emotion,”’ which is commonly reserved 
for those situations bearing a stronger affect. 
' It is true that the word “tension,” as used 
in psychology, may have several slightly dif- 
3 meanings. Warren’s Dictionary of 
hology lists three: “1) a feeling of strain 
"or suspense; 2) the condition of a muscle 
when it is acting against considerable resist- 
ance; 3) a state of inequilibrium, leading to 
"change in behavior which tends to restore 
lilibrium.”* The first and third of these 
nnotations receive emphasis in this study. 
» Most of those who have studied examina- 
Pion tensions have been interested mainly in 
affective state itseli—its physiological (or 
hophysical) components, its etiology, 
"symptomology, or cure. Few have shown 
/M@uch interest in the results of the examina- 
tion around which the tensions exist. The ex- 
ination has been a means of procuring the 
ifiective states; it has seldom been a center 

of interest. 
iy Dictionary of iszcholony, ub, 725: Edited by Howard C. 
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J. THomas Hastincs 
University of Illinois 


The problem of the present investigation 
has two main parts: (1) to develop a tech- 
nique by which the pupils taking an examina- 
tion can be differentiated in terms of intensity 
of tension directed toward the examination, 
and (2) to search for relationships between 
such tension scores and the examination 
results. 


It should be noted on the negative side 
that the purpose is not to investigate the 
causes nor etiology of examination anxieties 
and tensions, nor to study the effect on the 
individual of taking examinations. This is not 
because these .problems are considered less 
important or less real; it is because the in- 
terest, in this case, is centered in the exam- 
ination. 

It may be noted that there have been few 
studies of tension and school examinations.* 
From these few studies we find: (1) ample 
evidence that students do exhibit tensions in 
connection with examination situations, but 
tensions appear to be absent in the period fol- 
lowing the examination; (2) that the more 
difficult examinations do elicit the greater 
tensions in general; (3) the interest in exam- 
ination tensions has been in the symptomol- 
ogy or structure of the affective state, rather 
than in the relationships between tensions and 
examination results, although there were two 
investigations in which tension scores were 
correlated with examination scores; (4) the 
methods which have been used to study ten- 
sions were (a) the physiological components 
of affect (ie., blood pressure, heart and 
respiration rates, electrodermal response, 
blood count, quantity of blood sugar, and 
presence or absence of glycosuria), (b) dis- 
turbances in the speech and motor reactions, 
and (c) the report of individuals concerning 
their tension feelings (a questionnaire). 
Items (1) and (2) in foregoing enumeration 


2 Although a review of the literature is omitted in this 
paper, ee ee eee eS eee 
ind to study items 4, 9, 13, 14, 15 in 
the bibliography literature is reviewed in the doctoral 
dissertation of which this paper is the essential portion 
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are evident with any one of these methods. 
Also, we find that two types of relationship 
between tensions and examination results are 
suggested in previous work: (1) concomitant 
variation is suggested by the correlation tech- 
niques used by Brown and Waite, (2) a less 
standard (or reliable or predictable) exam- 
ination product for higher tension groups is 
suggested by Luria’s statements. 


OBJECTIVES OF THE INVESTIGATION 


The foregoing summary of the literature in 
the area of tensions and examinations clears 
the way for a more detailed statement of the 
problems with which the present investigation 
deals. It is the purpose of this section to 
present the specific objectives of this investi- 
gation. 


The primary statement of the problem has 
two parts: (1) to develop a technique by 
which the pupils taking an examination can 
be differentiated in terms of intensity of ten- 
sion directed toward the examination, (2) to 
search for relationships which such tension 
scores may bear to the examination results. 


Since the focal point of the study is the 
regular classroom achievement examination, 
the technique mentioned under item (1) must 
be applicable to such situations. Obviously, 
continuous records of physiological indices 
throughout the examination would be impos- 
sible for a regular examination. For purposes 
of classroom use, any laboratory method of 
measuring tension is, at best, awkward and 
laborious. This is certainly no reason, of 
itself, to abandon the laboratory methods, but 
it seems sound to’use less awkward or bur- 
densome methods, if they can be shown to 
correspond to the results of the more com- 
plicated procedures. A questionnaire method 
would solve this problem of practicability, if 
a sufficient degree of reliability and validity 
could be demonstrated. 


The first three objectives of this investiga- 
tion concern the development of such a tech- 
nique; i.e., it was sought (1) to develop a 
questionnaire which had as its purpose the 
differentiation of pupils in terms of tensions 
toward an examination, (2) to demonstrate 
a reasonable reliability for this instrument, 
(3) to demonstrate the validity of this instru- 
ment. In regard to validity, it is proposed to 
show: 
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a. That certain fundamental characteristics 
of validity are present in the question. 
naire results. 

. That differences in tension, as measured 
by the questionnaire, between examina- 
tion situations correspond to differences 
in objective characteristics of these 
situations. 

. That the results of the questionnaire 
correspond to the results of a technique 
which utilizes disturbances in speech and 
motor reactions as an index of tension. 

. That the results of the questionnaire 
correspond to the results of an accepted 
physiological index of affective states. 


In connection with the second part of the 
statement of the problem, three more objec- 
tives may be set forth: (4) to demonstrate the 
existence or lack of existence of concomitant 
variation between tension scores and exami- 
nation results (this may be accomplished by 
computing correlation coefficients between 
the two variables); (5) to compare the reli- 
ability of scores on achievement examinations 
for “low tension groups” with the reliability 
of scores of “high tension groups” on the same 
examinations; (6) to examine the predictabil- 
ity of test scores for “high tension groups” 
and for “low tension groups.” 


THE SAMPLE 


All of the ninth grade pupils who were 
taking mathematics (four were not) in the 
University of Chicago High School were used 
as subjects. These pupils were in three mathe- 
matics classes: two classes having twenty-nine 
pupils and the other having twenty-two. 
Since the study continued for a complete 
semester, absences caused the total number 
available for different purposes to vary from 
seventy-five to eighty. The number of pupils 
involved is reported for each statistic. 

Since the University School is a laboratory 
school, the pupils would be expected to be 
more accustomed to experimentation and 
therefore less upset by it, than would be the 
pupils of the usual public school. However, 
to help assure a desirable attitude, participa- 
tion in the study was placed on a voluntary 
basis. The mathematics teachers explained to 
their pupils the general purpose of the study, 
“to investigate examination situations,” and 
told them that it would involve some extra 
time in laboratory experimentation and in 
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answering questionnaires. The pupils were 
then given the opportunity of deciding indi- 
yidually whether or not they wished to par- 
ticipate. All of the pupils offered cooperation, 
though a few questioned the value before 
volunteering. 


THE EXAMINATIONS 


Each class was given four examinations: 
(1) a Semester examination which was the 
same for all three classes, given just before 
the end of the first semester in January, 1942; 
(2) an examination over a unit of work 
(three to four weeks of class work )—this ex- 
amination was different for each of the three 
classes; (3) a comparable form of this Unit 
examination on the day following the first 
Unit examination; (4) a Final examination 
over the entire year’s work—this Final exam- 
ination was the same for all three classes. 
These examinations will be referred to as T,, 
T,, T,, and T;, respectivély; and classes will 
be designated as Class 1, Class 2, and Class 3. 
Other unit tests were given to the groups dur- 
ing the second semester, but it seemed wise 
to limit the experimental work to fewer than 
all of the situations so that the pupils would 
not anticipate certain procedures. 


The experimental examination situations 
were selected with regard for differences in 
importance, length, and difficulty. These 
three characteristics are objective enough to 
allow for agreement concerning differences, 
and they can be expected to produce differ- 
ences, in general, in magnitude of tensions. 


The Semester situation involved greater 
difficulty than the Unit situation in the sense 
that a larger amount of work was sampled in 
the Semester examination. It was of greater 
importance than the Unit examination in de- 
termining final grades. In length it was the 
same as the Unit examination, a fifty-minute 
period. The second Unit examination situa- 
tion corresponded to the first one in terms of 
length and difficulty (comparable form), but 
these two situations differed considerably in 
importance, since the pupils were informed 
that the second Unit examination would “not 
count on your grades.” The fourth situation, 
that of the Final examination, was thirty 
minutes longer than the Semester examina- 
tion; it sampled a bit more material; and it 
was considered, by the pupils, as important 
as the Semester in terms of grades. 
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In order of decreasing amount, all three 
characteristics were present in the four exam- 
ination situations as follows: Final, Semester, 
Unit 1, and Unit 2. This gives a basis for 
expecting a definite pattern in average ten- 
sions in the four situations. 


THE QUESTIONNAIRE 


A questionnaire was developed as the main 
technique for differentiating pupils in terms 
of examination tensions. It was of such a 
length that it could be taken in approximately 
eight minutes. This shortness of time was de- 
sirable, since the questionnaire could be given 
at the end of an examination period without 
seriously limiting the time devoted to the ex- 
amination. This questionnaire was admin- 
istered to the pupils in each class at the close 
of each of the four examination periods. The 
following excerpt gives the instructions at the 
top of the questionnaire page and a few 
example items (the entire questionnaire is 
presented at the end of this article) : 


Here are some statements concerning 
your feelings about today’s ‘test. In each 
statement you are to choose one of the 
three phrases marked a, b, or c. In each 
statement choose the one phrase (a, b, or 
c) which will make the sentence most 
nearly describe your feeling. Indicate your 
choice by making a cércle around the letter 
of that phrase. 

2. While taking the test I felt . . . a. very 
nervous b. somewhat nervous c. not 
at all nervous 


. For the amount of time we had to work 
on it, the test seemed to me to be. . . 
a. about the right length b. much too 
long c. too short 


. As soon as I began to work on the test 
I felt ... a. very calm b. not at all 
calm c. fairly calm 


In order to have valid items in the sense of 
using the words of the pupils to state condi- 
tions which they might have felt, the con- 
struction of the questionnaire was started by 
requesting a group of twenty-five pupils to 
write out statements of how they felt about 
taking mathematics examinations in general. 
They were told that some pupils had stated a 
preference between doing “home-work” prob- 
lems and doing “examination” problems, and 
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that we should like to know what factors 
entered into this preference. The resulting 
essays were analyzed for common statements 
of tension, and these statements were incor- 
porated in the first draft of the questionnaire. 
The wording which the pupils used was 
allowed to stand as nearly like they had 
written it as the formality of the question- 
naire would allow. 

This first draft of the questionnaire was 
then administered to two of the experimental 
classes following unit tests which they had 
taken. The pupils were asked to comment on 
the items in this draft and to add any state- 
ments which they felt were pertinent. The re- 
sults of these administrations were used to 
delete one item, change the form of a second 
item, and add two items. This revised form 
is the one used in the main experiment. All 
construction was completed prior to the main 


experiment. 

In order to follow the lead suggested by 
Brown,’ this questionnaire is so worded 
throughout that it applies specifically to the 
examination which the pupil has just taken, 
not to examinations in general. 
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was made on the basis of the original essay 
statements. The answers were so arranged 
that no one value had the same position in 
every item. In the completed questionnaires 
there was no case of a pupil obtaining the 
same value on every item, nor was there a 
case of any pupil marking the same position 
for every item. This is evidence which sub- 
stantiates the claim that the pupils responded 
to the questionnaire items; they did not 
simply “fill in spaces.” 

As a check on the correctness of the weights 
which were assigned to the answers, a measure 
of internal consistency was computed for 
each item, After the first administration an 
item analysis was made of all the papers. The 
total score on the questionnaire was used to 
select three groups: (1) the twenty-two pupils 
who made the highest scores, (2) the twenty 
pupils who made the lowest scores, (3) the 
seventeen pupils who made scores around the 
average. A mean was found for each group on 


each item, These means are presented in 
Table I. 


TABLE I 


MEAN ON EACH ITEM OF THE QUESTIONNAIRE FOR HiGH, Low, AND AVERAGE GROUPS 
SELECTED ACCORDING TO TOTAL SCORE 


Group N 


Item Number 





1 7 
High _. 22 2. ; : } : ‘ ’ 
Aver- 

age_. 17 2. 
Low... 20 1. 


In order to answer each item on the ques- 
tionnaire,* the pupil is required to make a 
choice among three possible answers. The 
three answers under each item were each given 
a value of 1, 2, or 3 points: 1 point for the 
answer which is indicative of the least tension, 
3 points for the answer which is indicative of 
the greatest tension. The total score is the 
sum of item scores. 

For most items this discrimination of an- 
swer values is an easy matter—one answer 
clearly states high tension and another answer 
is the opposite. For all items the discrimina- 
tion between “highest” and “lowest” answers 


an H. ay “Emotional Reactions before Examina- 
tions: Results of a Questionnaire,” and “‘II1, Intercorrela- 
tions,” yp L-~ of y V(1938), 11-26, 27-31. 


*The pupils were requested to answer every item and 
they did. 


2 16 
2.7 


1 
2. 


1 
2 
2.6 2.1 
2.0 1.6 


+ 
.8 
5 


If an item were weighted incorrectly, the 
item means for the three groups (selected 
according to total score) would not bear the 
relation to each other of high, low, and 
medium for the corresponding groups. This 
correspondence is evident for each item in the 
Table. One would neither expect nor desire 
all these item means to show the same pro- 
portionate relationship, nor to show the rela- 
tionship of 3:2:1, since this would indicate 
that each item would differentiate the pupils 
as well as the total set differentiates them. 


It was felt that a more elegant statistic was 
not required here, since the intent was merely 
to ascertain whether or not any single item 
should have answer values placed in a differ- 
ent order. 
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Luria TECHNIQUE 


One of the objectives of this investigation 
was stated: to show how the results of the 
tionnaire correspond to the results of a 
technique which utilizes disturbances in 
and motor reactions as a measure of 
examination tensions. The technique used for 
this is the one which has been described so 
thoroughly by Luria® that it is designated 
with his name in the literature. 

The apparatus has been described, pictured, 
and diagramed so frequently in readily avail- 
able literature, that it would be superfluous 
to repeat the entire process here. Luria’s 
description has been cited previously. Other 
descriptions are presented pictorially and 
diagrammatically by Gardner,® who used 
levers for the hand pressures and film for the 
permanent record; by Langer,’ who describes 
an ingenious and compact apparatus which 
utilizes film for the record; by Olsen and 
Jones,* who used pneumatic systems and pen 
records on paper; and by Sharp,® whose 
method is much the same as that of Olsen and 
Jones. 

In the present study, pens attached to an 
air system by means of a Marey tambour re- 
corded the hand pressures on six-inch paper, 
which was driven by a constant speed motor. 
Stimulus time and verbal response time were 
recorded on the same paper by means of a 
pen which was attached to an electromagnet 
and operated by the experimenter with a 
spring key. The experimenter would depress 
the key as he stated the stimulus word and, 
again, as the subject spoke the response. Each 
such depression of the key would cause a 
break in the stimulus-response record line. 
Time intervals—fifths of a second—were re- 
corded on the paper by a fourth pen, which 
was in series with a circuit breaker operating 
on a synchronous, constant speed motor. 

An example of the Luria record is shown in 
Figure 1. Each record for each pupil con- 

*A. R. Luria, The Nature of Human Conflicts, pp. 24-27. 
Translated by W. Horsley Guatt. New York: Liveright, Inc., 


“ Rae tal Study of the Luria 
tal Conflict,” Journal of Experi- 
mental Psychology, Ay 495-506. 

"2. _¢. Leneer, @ Tremograph,” Journal of General 
Psychology, XV(1936), ete 

* Dorothy Olsen and > “An Objective Measure 
of Emoticaal Toned Atdvodes ogical and 
Journal of Genetic Psychology, (1931), 174-96. 

* Delia Larson Sharp, “Group and Individual Profiles in the 
Association-Motor A, "se mp. Le =e of lowa 
Studies; Studies in Child ators, ol. No. 1 1. Towa 
City, Towa: University of Iowa, Hn 
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sisted of four lines: (1) a line which showed 
time in fifth-second intervals; (2) a line 
which showed, by means of breaks, the inter- 
val between stimulus and verbal response; 
(3) a line which indicated relative pressures 
for the preferred hand (voluntary)—here- 
after called the right-hand pressure line; (4) 
a line which indicated relative variations in 
pressure for the non-preferred hand (invol- 
untary )—hereafter called the left-hand pres- 
sure line. The pupils’ verbal responses were 
recorded at the time of the experiment as a 
matter of formality and in order to motivate 
the individuals to respond properly. These 
were not used in the subsequent analysis for 
two reasons: (1) there was sufficient objective 
evidence at hand without them to fulfill the 
stated purpose of the technique—to show 
correspondence between questionnaire and 
speech and motor reactions; (2) there are no 
objective criteria of “disturbed responses” for 
the critical words used in this study.*° 

The Luria technique was used for each 
pupil twice. The first administration took 
place two weeks before the Semester examina- 
tion. At this time no mention had been made 
of the examination date nor of preparation 
for the examination. The Luria was given a 
second time just preceding the examination; 
namely, on the same day or on the preceding 
day.* These two administrations will be de- 
noted as L, and L,, respectively. 


The stimulus words used for L, were as 
follows: 


neutral words—also, captain, cedar, clean, 
deep, design, enjoy, exchange, follow, gro- 
cery, house, nourish, nuisance, occasion, 
path, prefer, prospect, ride, salute, satisfy, 
table, tiger, true, unseen, watched; 

critical words—arithmetic, equation, graph, 
number, test; 

post-critical’* words—costume, cover, moun- 
tain, purpose, umbrella. 
% Such criteria for older age levels may be found in: G. H. 

Kent and A. J. Rosanoff, “A Study of Association in In- 

sanity,” American Journal of Insanity, LXVII(1910), 37-96, 


317-90; for select words at younger ‘age levels, see’ Herbert 
Woodrow and Frances Be ‘ 


al M. As, Vol. XXII, Me's. 
uency Tables. ~ Psychologi onogra " I o. 5. 
Princeton Psychologica Review Co is (xt) 

mayo ig ye cle eye agng OR a 


schedule was made whereby a different pupil would come to 
the laboratory every ten minutes. Obviously, they could not 
all be ted Roca Aver preceding the examination, but it 

the schedu: schedule so that each could 


was to arrange 

be on the ; oe day of the examination or within the 
four > hous that day. The three classes took 
inations on di 

13 Words <m tliew the critical words but which, other- 

wise, would be neutral. 
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Fig. a-- A traced copy of port f 2 buria record. 


The stimulus words used for L, were as 
follows: 


neutral words—city, cover, design, door, false, 
grocery, nourish, nuisance, occasion, path, 
prospect, ride, satisfy, umbrella, watched; 

critical words—algebra, arithmetic, equation, 
graph, number, problem, rectangle, sign, 
solve, test; 

post-critical words—also, clean, costume, ex- 
change, follow, long, prefer, purpose, salute, 
tree. 


The words were selected with careful con- 
sideration of two criteria: (1) that the mean- 
ing of neutral words should not be connected 
with mathematics nor examinations, but the 
critical words should be connected with these 
areas; (2) that the words should not present 
equivocal meanings. 

As may be seen, the meanings of the critical 
‘words in each list are associated with “mathe- 
matics” or with “examination.” The meanings 
of neutral words are not so associated. Lack 
of ambiguity for the neutral words was 
checked by using only words which 
in the list by Woodworth and Wells, for 


which words they claim: “. . . so far as pos- 
sible [the words are] unambiguous.”** 

On his first visit to the laboratory, the pupil 
was shown the apparatus and told that we 
wished to get a measure of his reaction time 
to words. He was seated before the hand 
tambours and made to feel as much at ease 
as the conditions of the experiment would 
permit. Next he was instructed in the proce- 
dure of responding to the stimulus words 
verbally and with hand pressure. He was cau- 
tioned not to rest his hands elsewhere (after 
the chair was adjusted so that his fingers could 
comfortably reach the tambours), and to keep 
his “left-hand” (non-preferred hand) steady 
on the tambour. He was requested to face 
forward and make sure that he was comfort- 
able before the experiment began. 

The experimenter sat to the right of the 
subject and slightly behind him, in order that 
he could control the signalizing key for stim- 
ulus and response without detracting the sub- 
ject’s attention. A practice run of five or six 
words (not from the regular list) was used 


pont %, Steet sot Dntels Lense Wate cpeetete 
ests,” p. . Ps gic: ‘onographs, Vol. » No. 5. 
Princetou: Psychological Review Co., 1911. 
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to see if directions were being followed. The 
number of practice words was increased if 
there was doubt as to instructions. This pro- 
cedure, which was allowed to remain informal, 
was standarized in terms of specific steps. 
With the exception of the introduction to the 
apparatus, the procedure was repeated for the 
second Luria. 

Many features of Luria records have been 
used as indices of affective tension. Luria 
quantified only the speech reaction-time.’* 
Others have used various measurements of 
length in the pressure curves.** For the pur- 
pose of the present investigation—only three 
measures were used: (1) verbal reaction-time 
(VT); (2) height of right-hand response 
(HR); (3) height of left-hand response 
(HL). The first of these was estimated in 
tenths of a second by use of the time interval 
line; height measurements were made in 
millimeters. 

These three measurements were made for 
each of the thirty-five words on every record. 
Two sets of primary scores were tabulated for 
each individual—one for L,, the other for 
L,. Each set consisted of three scores for each 
word—VT, HR, and HL scores. The scores 
which were used as indices of tensions were 
derived from these primary measurements. 

The thirty-five VT scores, the thirty-five 
HR scores, and the thirty-five HL scores were 
summarized for each pupil on each Luria by 
obtaining four statistics: (1) the standard 
deviation for the thirty-five words (o,); (2) 
the mean of the neutral words (M,); (3) the 
mean of the critical words (M.); (4) the 
mean of the post-critical words (M,). Thus, 
each pupil’s responses on each Luria were 
represented by four numbers on VT, by four 
on HR, and by four on HL. These means 
may be considered as “best estimates” of a 
pupil’s responses to the neutral words, to the 
critical words, and to the post-critical words, 
respectively. 

The Luria technique utilizes differences in 
response between neutral words and post- 
critical words as well as between neutral words 
and critical words. The two types of differ- 
ences are used on the basis that excess excita- 
tion caused by the critical word may carry 
over to the post-critical word, or, in some 
cases, actually not appear until the post- 
critical word is presented. In the present in- 
“Luria, . . . pp. 46-76. 


“One of the most comp'ete lists of possibl 
may be found in Reymert and Speer . . . p. 195. 
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vestigation the larger of the two differences, 
M, — M, or M, — Mg, was used in all cases. 
These differences were divided by the stand- 
ard deviation of the responses for the total 
word list, in order to make them abstract 
numbers (independent of the unit of measure- 
ment). These derived “tension scores” are 
represented by D, and their derivation may 
be shown by the statement: 


D=M,.— M,, if M.= My, or 


a 


D=— M,— M,, if M. < M, 


or 


By means of these formulas D-scores were 
derived for each pupil. By this final compu- 
tation each pupil was represented by three 
scores for each Luria. Since these D-scores 
are abstract numbers, the three for one pupil 
—VT, HR, and HL—may be combined. The 
advantage of this combined D-score is evident 
when one realizes that any tension which ex- 
ists may be evidenced in any one, but not 
necessarily all, of the three responses. These 
D-scores were used in showing relationship 
between the results of the questionnaire and 
the results of the Luria technique. 


RESPIRATION INDEX 


In order to demonstrate correspondence be- 
tween the results of the questionnaire and 
results of a respiration measure at examina- 
tion time, two groups of pupils—a high ten- 
sion group and a low tension group—were 
selected from the entire sample for use in the 
respiration experiment. There were seven 
boys and seven girls selected for each group. 
Since the respiration experiment was to be 
run before the Final examination, whereas the 
questionnaire was always presented at the 
close of an examination, the questionnaire 
which accompanied T, could not be used for 
selection. The best available method of selec- 
tion of these groups consisted of results of 
previously administered questionnaires. 


Those in the “high” group had scored above 
the mean for the total group in each of three 
questionnaires: Q,, the one which accompa- 
nied the Semester examination; and Q, and 
Q,, which accompanied the Unit examinations. 
Those in the “low” group had scored below 
the means for the entire group in each of these 
questionnaires. 
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Arrangements were made for these twenty- 
eight pupils to come to the laboratory indi- 
vidually just before the final examination. 
The length of time preceding the examination 
varied from one-half hour to twenty-four 
hours. The schedule was arranged so that the 
individuals of the “high” and “low” groups 
were treated the same in regard to the lapse 
of time. 


The technique for obtaining the respiration 
record was very conventional. A respiration 
bellows attached to the chest of the subject 
by means of a belt picked up the breathing 
movements. These changes in pressure were 
transmitted to a pen by means of an air sys- 
tem. The respiration record was an ink line 
which showed inhalations and exhalations by 
corresponding waves. 


Besides the respiration line itself, the record 
consisted of a line which measured time in- 
tervals and a line on which breaks signalized 
the beginning or the ending of the various 
types of stimulus periods used in the proce- 
dure. Rate of respiration could be measured 
by counting the number of wave crests (or 
troughs) appearing in conjunction with a 
given length of the time line. 

As each pupil entered the laboratory, the 
apparatus was shown to him, and its general 
function was described. After he was seated 
in a desk-arm chair facing the experimenter, 
the respiration belt was fastened about his 
chest. Care was taken to see that the subject 
was comfortable and that the belt was prop- 
erly adjusted. A short record was run to 
assure the proper functioning of all apparatus. 
The pupil was requested to sit up straight 
and to remain quiet. This admonition was 
repeated when necessary during the experi- 
ment. The experimental period was compara- 
tively short, and the pupils cooperated well. 


The experimental period -was divided into 
six main parts: (1) an initial rest period 
which lasted almost two minutes (so that the 
first part could be discarded and three half- 
minute periods would remain); (2) and (3) 
a questioning period, which was later divided 
into two parts (each slightly over two minutes 
in length); (4) and (5) two one-minute 
periods during which the pupil worked prob- 
lems on an addition test; (6) and a final rest 
period identical with the initial rest period in 
time and directions given. 
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For the first period the pupil was directed 
to “relax and think about as near nothing as 
you can.” Time was allowed for the breathing 
movements to become more or less regular 
before the record was started. For parts (2) 
and (3) the pupil was asked the following 
questions: Do you know whether this coming 
examination will count more in terms of 
grades or about the same in terms of 
grades as do most of the mathematics tests 
that you have taken through the year? Do 
you feel more or less worried about this 
mathematics examination (the Final examina- 
tion) than you do about other subject matter 
tests in general? Can you think of any type 
of material on this coming examination which 
might worry you more than other types? If 
so, what material? What statement could you 
make about how well prepared you feel that 
you are for this math examination? Knowing 
what grades you have made in general on 
unit tests in mathematics, what grade (A, B, 
C, D, F) would you believe or guess that you 
may make on this coming examination? 


These questions were intended as excitation 
stimuli for those who would express examina- 
tion tensions. Following the questions, two 
periods, (4) and (5), were devoted to work- 
ing an addition test, copies of which were 
attached to the desk-arm of the chair so that 
the pupil would only have to use his writing 
arm. This test had two psychological connec- 
tions. with the coming mathematics examina- 
tion: it had been used at the time of two pre- 
vious examinations; and it was arithmetic 
material. At the end of the fifth period the 
pupil was instructed to relax “once more—as 
you did so well at the first part of the experi- 
ment.” The record of this relaxation period 
constituted the sixth period. Each of these 
six periods was marked on the record by 
means of the signalizing pen, which was oper- 
ated by the experimenter with a spring key. 


Respirations were counted in half-minute 
intervals for each of the six experimental 
periods. These counts were averaged for each 
pupil for each period. The averages, con- 
verted into respirations per minute, (a more 
conventional unit for reporting than is the 
half-minute unit) were used as scores. In this 
way, the final records show one score for each 
pupil on each of the experimental periods. 
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QUESTIONNAIRE RELIABILITY 
AND VALIDITY 


An “odd item—even item” division was 
used for the split-half method of estimating 
reliability. Reliability coefficients were ob- 
tained for each of five subgroups and for the 
total number of pupils (seventy-seven pupils 
completed the questionnaire in each of the 
four administrations). The five subgroups 
were: (1) class 1, twenty-eight pupils; (2) 
class 2, twenty-two pupils; (3) class 3, 
twenty-seven pupils; (4) all of the boys, 
forty pupils; (5) all of the girls, thirty-seven 
pupils. Obviously, the membership of the last 
two groups overlaps that of the other three. 
These subgroups were used in order to deter- 
mine whether any one class or sex presented 
a notably different pattern of reliability co- 
efficients than did the total group of pupils. 

As stated previously, the questionnaire was 
administered to each group four different 
times. A further check on the reliability of an 
instrument is the consistency of the magni- 
tude of reliability coefficients under differing 
circumstances. Twenty-four reliability coeffi- 
cients were computed for the questionnaire; 
one for each administration for each of the 
six groups. These coefficients were substituted 
in the Spearman-—Brown formula for estimat- 


ing the reliability of a whole test from the - 


reliability of half of the test. These corrected 
coefficients are presented in Table II, together 
with the corresponding standard deviation for 
each group. 

With the exception of four of them, all of 
the coefficients exceed .70; six of them are 
80 or above; and the median value for the 
twenty-four coefficients is .76. For the Total 
Group, the coefficient has a range of only .74 
to 80. No one group has the highest nor 
lowest correlation in all situations, and for 
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no one situation are the coefficients the highest 
nor the lowest for all groups. In other words, 
there is no trend in terms of situations or 
groups. There is consistency in the magnitude 
of the coefficient for the twenty-four esti- 
mates. The median value of the reliability 
coefficients is .76. 

The following statistics are presented as an 
aid in the interpretation of the magnitude of 
this coefficient: the standard errors of re- 
sponse for the individual** for the four situa- 
tions (Semester, Unit 1, Unit 2, and Final) 
are 3.0, 3.2, 2.3, and 2.8, respectively; the 
corresponding score ranges are 20 to 44, 18 
to 46, 16 to 42, and 22 to 46. An individual 
response error of approximately 3.0 for a 
sixteen-item questionnaire which shows a 
range in scores of approximately 25 points is 
small enough to justify a claim of consistency 
of measurement. 

The foregoing data substantiate the state- 
ment that the questionnaire is sufficiently 
reliable that its scores may be used to differ- 
entiate groups of pupils. Furthermore, the 
reliability coefficients for six groups in four 
different situations show a consistency of 
magnitude which supports use of the ques- 
tionnaire under varying conditions. 


VALIDITY 


Certain fundamental characteristics of 
validity are present in the questionnaire: (1) 
the items in the questionnaire pertain to the 
characteristics with which the investigation is 
concerned; (2) the examinees understood the 
phraseology of items in the questionnaire; 
(3) the questionnaire results do meet the 
common-sense expectation of significant rela- 


quantity one minus the reliability of measurement. T 
pon ppb RL. RA A. oy 
total group. 


TABLE II 


RELIABILITY COEFFICIENT (7),*° STANDARD DEVIATION (¢), AND NUMBER OF CASES (N) For EACH 
or Five SUBGROUPS AND THE TOTAL GROUP ON EACH OF FOUR 
ADMINISTRATIONS OF THE QUESTIONNAIRE 


Class 1 
N =28 


Class 2 
N =22 
Examination 


Class 3 


Boys 
N =27 


N =40 


Girls 
N =37 


Total 
N=77 





Situation 


r o r 

-78 7.18 . 80 
-76 6.19 
-56 3.10 
.61 5.64 


o 
6.81 
-74 6.24 
-76 4.68 
-75 5.65 


* By the “split-half”’ method; corrected for length. 
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tionships—correlations different from zero— 
between tension rankings of pupils in different 
examination situations. The six intercorrela- 
tions for the four administrations of the 
questionnaire are presented in Table III to- 
gether with the reliabilities for the total group 
of seventy-seven pupils. The intercorrela- 
tions are all positive and each exceeds the 
minimum value necessary for significance at 
the 1 per cent level of confidence.*’ 


TABLE III 


INTERCORRELATIONS AND RELIABILITY COEFFI- 
CIENTS* FroM Four ADMINISTRATIONS 
OF THE QUESTIONNAIRE TO 
SEVENTY-SEVEN PUPILS 


Examination Situation 


Semes- 
ter Unit 1 
. 80 


Examination 
Situation 





Unit2 Final 


*From Table 2, inserted here for comparison 
with intercorrelations. 


Questionnaire-score differences between ex- 
amination situations corresponded to differ- 
ences in objective characteristics of these 
situations. More particularly, when the de- 
gree of difficulty, the length, and the impor- 
tance to the pupil increased or decreased from 
one examination to another, the questionnaire 
scores for those examinations increased or 
decreased, respectively. 

Table IV shows the questionnaire means for 
seventy-seven pupils for each of the four 
examination situations. The standard devia- 
tion and standard error of the mean are 
presented for each situation in the same 
Table. It may be seen by inspection that the 
four means for the total group do indicate 
that average tension (as measured by the 
questionnaire) does vary from situation to 
situation. In order of decreasing magnitude 
of tension, the situations are ranked as fol- 


aE. F. Lindquist, Statistical Analysis in Educational Re- 
search, pp. 210-12. Boston: Houghton Mifflin Co., 1940. In 
Table 13, “Values of Correlation Coefficients Required for 
Significance at the 5 Per Cent and 1 Per Cent els for 
Samples of Various Sizes,” the minimum value at the 1 per 
cent level for 75 cases is .296. Since the value varies inversely 
with the size of the sample, the minimum value for 77 cases 
would be less than .296. The intercorrelations for the ques- 
tionnaire exceed this value. 


%® Since the classes each consisted of fewer than thirty 
upils, grad 1) instead of N was used in computing standard 
jeviations these groups. For a discussion of this, see 
—se - cit., pp. 49-50. In order to be consistent, this 

formula was used for all standard agin hy even though its 
effect with larger samples is probably negligib 
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TABLE IV 


QUESTIONNAIRE MEAN, STANDARD DEVIATION, 
AND STANDARD Error OF MEAN FROM Four 
EXAMINATION SITUATIONS FOR 
SEVENTY-SEVEN PUPILS 


Examination 
Situation 


Standard 
Standard Error 


Deviation of Mean 


6. 81 
6. 24 
4. 68 
5. 65 


Mean 





0.78 
0.71 
0. 53 
0. 64 


lows: Final, Semester, Unit 1, Unit 2. This 
order is the same as that suggested by ranking 
the examination situations on the basis of the 
characteristics of length, difficulty, and im- 
portance. The differences in means are sta- 
tistically significant. 

For further validation of the questionnaire 
it was proposed to show the correspondence 
between the results of the questionnaire and 
the results of a technique which utilizes dis- 
turbarces in speech and motor reactions as an 
index of tension. The Luria technique, which 
was chosen for this demonstration, has been 
described. 

There were seventy-three pupils who com- 
pleted both Lurias and the questionnaire at 
the time of the Semester examination. In 
order to make a comparison between ques- 
tionnaire results and Luria results, these 
pupils were divided into two groups, a high 
tension group and a low tension group, 
according to the scores which they made on 
the questionnaire at the time of the Semester 
examination. The mean questionnaire score 
for the total group was 31.7. Of the seventy- 
three who completed the Lurias and the ques- 
tionnaire there were thirty-seven who scored 
32 or above on the questionnaire. These com- 
pose the high tension group. The other 
thirty-six pupils, who scored 31 or below on 
the questionnaire, form the low tension group. 

If the questionnaire results correspond to 
the results of the Luria, the scores on the 
second Luria must show that these two groups 
differ significantly. Table V presents the mean 
D-score (combined measure) for each group 
on that Luria which was taken immediately 
before the examination. This combined D- 
score is the most appropriate measure to com- 
pare with questionnaire results, since tension 
as measured by the Luria may be expressed 
in any one, but not necessarily all three, of 
the indices (VT, HR, and HL). It can be 
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TABLE V 


MEANS AND DIFFERENCES BETWEEN MEANS FOR 
HIGH AND FoR Low TENSION Groups (By 
QUESTIONNAIRE) ON THE COMBINED 
D-ScoRE OF THE SECOND LURIA 


Tension Group 
High 





Difference of Means 

Standard Error of Difference___ 
t-value 
Level of Confidence 

seen that the high tension group, according 
to the questionnaire, scored higher on the 
Luria than the low tension group and that 
this difference is significant at the 9 per cent 
level of confidence. 
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samples. The high tension group, as selected 
by the questionnaire, has a higher mean score 
than the low tension group on both verbal 
reaction-time and left-hand response. 


It was felt that further evidence on the 
validity of the questionnaire might be ob- 
tained by computing the mean of the differ- 
ence between L, scores and L, scores for the 
high and low tension groups as selected by 
the questionnaire. Although, as explained, 
there was a general drop in tension between 
L, and L,, one would expect the low tension 
group to show a greater drop than the high 
tension group, if the questionnaire is measur- 
ing examination tensions. The combined D- 
score on each Luria was used for the compu- 
tation of mean difference. The mean differ- 
ence for the high tension group and the mean 
difference for the low tension group are shown 


TABLE VI 


MEANS AND DIFFERENCES BETWEEN MEANS FOR HIGH AND FOR LOW TENSION GROUPS 
(By QUESTIONNAIRE) ON THREE INDICES* OF THE SECOND LURIA 


VT 


HR HL 





Tension Group 


Tension Group Tension Group 





Low High Low High Low 





Number of Cases_......_.._____- 


36 37 36 37 36 


0. 08 0.15 0. 22 0.16 —0. 01 
0.47 0. 49 0. 49 0. 60 0.47 
0. 08 0.08 0. 08 0.10 : 





—0. 07 0.17 
0.11 0.13 
0. 64 1.31 
52% 19% 


ses * VT, verbal reaction-time; HR, height of right-hand response line; HL, height of left-hand response 


The mean D-scores for the high tension 
group and for the low tension group on verbal 
reaction-time, right-hand response, and left- 
hand response are shown in Table VI. 
Although these separate indices are not con- 
sidered to be as good measures of tension as 
the combined score, the differences between 
means of the high and low groups are in the 
desired direction with two of the indices, VT 
and HL. In one case only, the right-hand re- 
sponse score, does the high tension group have 
a lower score on the Luria than the low ten- 
sion group. This is the only case, also, in 
which the difference is clearly insignificant— 
being a difference which would be expected 
by chance in approximately half of all such 


in Table VII. The order of subtraction is such 
that a negative number means a drop in ten- 
sion from L, to L,. Both groups show a drop, 
but the low tension group shows a greater 
drop than the high tension group, and the 
difference between them is significant at the 
8 per cent level of confidence. As a matter of 
fact, the standard error of the mean for each 
group indicates that the mean of the differ- 
ence between L, score and L, score for the 
high tension group may not be different from 
zero, whereas this mean of the difference for 
the low tension group is definitely negative. 
In other words, the low tension group, as 
selected by the questionnaire, shows a greater 
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TABLE VII 


MEAN DIFFERENCE BETWEEN COMBINED D- 

DIFFERENCE BETWEEN MEANS, FOR 
HIGH AND FOR Low TENSION 
Groups (BY QUESTIONNAIRE) 


Tension Group 
High Low 





Difference of Means 

Standard Error of Difference __- 
t-value 
Level of Confidence 


1.75 
8% 


drop in tension from L, to L, than does the 
high tension group. 

The results of the questionnaire corre- 
sponded to the results of a technique which 
utilized disturbances in speech and motor re- 
actions as an index of tension. The groups 
selected as high and low in tension according 
to the questionnaire showed the same order 
in terms of their mean score on the Luria 
method. 

It was proposed to show correspondence 
between the results of the questionnaire and 
a respiration index used as a measure of ten- 
sions. The apparatus, the experimental setup, 
and the scoring of the respiration records were 
described in a foregoing section. 

If this respiration measure is an index of 
tension, the number of respirations per minute 
during the middle four “disturbance” periods 
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(questions and addition) should be greater 
than the number during the two rest periods, 
To determine whether this was so or not, the 
respiration scores for the two rest periods 
were averaged for each pupil, and the scores 
for the other four periods (disturbance 
periods) were averaged for each pupil. These 
averages are given in Table VIII. The average 
number of respirations per minute is larger 
for each of the twenty-seven pupils in the 
disturbance periods of questioning and addi- 
tion than it is in the rest periods. 

The average respirations per minute during 
the rest periods for the high tension group 
(questionnaire) and the same statistic for the 
low tension group are shown in Table IX to- 
gether with the difference between the means 
and the ¢-value of this difference. Since this 
t-value (for twenty-five degrees of freedom) 
indicates a level of confidence between 70 
per cent and 80 per cent, it can be stated that 
the two groups, high and low, did not differ 
in average number of respirations per minute 
during the rest periods. The two groups were 
alike on this respiration measure at the time 
of the rest periods. 

Table X presents the same statistics for 
the two groups as those in the preceding 
Table, except that the disturbance periods of 
questioning and addition were used for the 
means in Table X. The two groups, as 
selected by the questionnaire results, differed 
in terms of respirations per minute during 
the disturbance periods of the experiment. 
This is especially significant, since the two 


TABLE VIII 


AVERAGE NUMBER OF RESPIRATIONS PER MINUTE FOR THE Two PERIODS OF REST AND FOR THE 
Four PERIODS OF DISTURBANCE (QUESTIONS AND ADDITION) FoR EACH PUPIL 


High Tension Group* 





ERRRRS: 
anwnnoce 


* Selected by questionnaire scores. 


Low Tension Group* 


Rest 
Periods 

18.0 
11.5 
18.0 
15.0 
17.0 
14.0 
14.5 
16.0 
20.5 
12.5 
15.0 
22.5 
19.0 





Disturbance 
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groups were shown to be alike in the rest 


The results of the questionnaire corre- 
sponded to the results of a respiration measure 
of tension in that groups selected as high and 
low in tension according to the questionnaire 
showed the same order in terms of their mean 
scores on the respiration measure. Although 
the groups did not differ significantly when 
the measure was applied during a rest period, 
they did differ significantly in the direction 
which demonstrated correspondence between 
questionnaire and respiration measures, when 
the respiration measure was applied during 
periods which should incite examination ten- 
sions. 


TABLE IX 


MEAN, STANDARD DEVIATION, AND NUMBER OF 
CASES FOR HIGH AND FoR LOw TENSION 
Groups* ON RESPIRATIONS PER MINUTE 

During Rest PERIOD; DIFFERENCE 
BETWEEN MEANS, t-VALUE, AND 
LEVEL OF CONFIDENCE FOR THE 
Two Groups 


Tension Groups 
High 





* Selected by questionnaire scores. 


TABLE X 


MEAN, STANDARD DEVIATION, AND NUMBER OF 
CASES FOR HIGH AND FoR Low TENSION 
GrRouPS* ON RESPIRATIONS PER MINUTE 

DURING DISTURBANCE PERIODS"; DiF- 
FERENCE BETWEEN MEANS, t-VALUE, 
AND LEVEL OF CONFIDENCE FOR 
THE Two GROUPS 


Tension Groups 





*Selected by questionnaire scores. 


*Two periods of questionning and two periods 
of addition. 
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TENSION INTENSITY AND EXAM- 
INATION RESULTS 


The second objective of this investigation 
was “to search for relationships which tension 
scores may bear to the examination results.” 
The validity of the questionnaire which was 
developed has been demonstrated, and it was 
shown to have a reliability which is suffi- 
ciently high to justify group selection. Con- 
sequently, high tension and low tension groups 
were selected by means of questionnaire 
scores for this study of relationships between 
tension intensity and examination results. 


The following areas of possible relation- 
ships were selected for study by the methods 
indicated: (1) concomitant variation by the 
use of correlations; (2) consistency or vari- 
ability of examination results for high tension 
and low tension groups by means of reliabil- 
ity coefficients for the examinations; (3) the 
predictability of examination results for high 
tension and low tension groups by comparing 
the indices of value of prediction for such 
groups and by obtaining differences between 
predicted scores and obtained scores on ex- 
aminations for such groups. It is not claimed 
that this is an exhaustive list of all possible 
relationships and methods of study, but it is 
claimed that the areas described are impor- 
tant in terms of use of classroom examination 
results. 


The correlation coefficients between tension 
scores and examination scores are presented 
in Table XI. It appears that concomitant 
variation between tensions and examination 
scores tends to be inverse, although high ten- 
sions are certainly not confined to those mak- 
ing low marks on the examination. Only seven 
of the eighteen coefficients are larger than the 
magnitude required for significance at the 5 
per cent level; the meaning of these is rather 
lost in the inconsistency with which they 
appear for any one group or for any one 
examination situation. 

Of course, the correlation coefficients give 
no sure basis for an interpretation of causal 
relationship; they merely indicate the degree 
of association between the two variables. 
However, for those who investigate the causa- 
tion of tensions, the main implication of the 
foregoing findings is that other factors than 
lack of ability on the examination must enter 
into such causation. Studies directed toward 
the areas of “ego involvement” and “level of 
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TABLE XI 


CORRELATION COEFFICIENTS BETWEEN EXAMINATION SCORES AND QUESTIONNAIRE SCORES Fog 
VARIOUS GROUPS IN EACH OF THE FouR EXAMINATION SITUATIONS 


Examination Situation 


Group 





Class 1 
N =28 
—.40 
ll 
. 05 
—.06 
.37 


Class 2 
N =22 
—.38 
—.25 
—. 52 


Class 3 
N =27 
—. 26 


Girls 
N =37 


Boys 
N =40 


51 
42 


* Since the Unit examinations were different for each class, correlations could not be computed for 
three groups. 
» Correlation coefficients required for significance at the 5% level for the size sample used. (See E. F, 


1 


aspiration” might help answer the question as 
to why these tension-examination correlations 
tend to vary so much. On the other hand, for 
those who infer causation in the other direc- 
tion—higher tensions resulting in lower scores 
—the findings indicate that this might be true 
only in certain situations, not in all situations. 

It was hypothesized that the examination 
product of students with high tensions would 
be less reliable than that of those having 
lower tensions. Split-half reliability coeffi- 
cients were computed on each examination 
for a high tension group and for a low tension 
group. The tension groups were selected by 
means of the questionnaire results. For the 
Semester and for the Final examination, each 
of which was the same for all pupils, the 
group of pupils scoring at or near the mean 
on the questionnaire was excluded in order to 
give more meaning to the phrases “high ten- 
sion” and “low tension.” In the Unit 1 and 


oar Statistical Analysis in Educational Research, p. 212. Boston: Houghton Mifflin Co,, 


Unit 2 situations each class took a different 
examination. Since each class consisted of 
fewer than thirty pupils, it seemed advisable 
to select “high” and “low” tension groups on 
the questionnaire by splitting the distribution 
at the median; that is, because of the small 
number of cases, the “average” group was not 
excluded on these two examinations. The re- 
liability coefficients for all groups are pre- 
sented in Tables XII and XIII. 

Differences between reliability coefficients 
on examinations for the high tension and low 
tension groups were all too small to be sta- 
tistically significant, but there was a high 
degree of consistency in the direction of the 
differences; namely, the high tension group 
presented a lower reliability coefficient than 
did the low tension group in eight of the ten 
possible comparisons. It should be noted that 
the various comparisons in reliability coeffi- 
cients are not made with the same pupils from 


TABLE XII 


RELIABILITY COEFFICIENTS* AND STANDARD DEVIATIONS ON SEMESTER AND FINAL EXAMINATIONS 
FoR HIGH AND FoR Low TENSION Groups WITH GIVEN MINIMUM AND MAXIMUM 
SCORES ON QUESTIONNAIRE 


High 


. 98 
35 or above 
28 


* Split-half method; corrected for length. 


Tension Grou 
Low High Low 


Semester Examination 


. 93 . 93 
28 or below 37 or above ~ 
26 22 


.94 
27 or below 
22 
11.6 10.5 11.0 
Final Examination 
. 98 
30 or below 
22 


16.8 


. 98 
31 or below 
24 
17.0 


. 96 
38 or above 
20 
14.1 
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TABLE XIII 


RELIABILITY COEFFICIENTS* AND STANDARD DEVIATIONS FOR HIGH AND FoR LOW TENSION GROUPS” 
oF EacH CLAss ON UNIT 1 AND UNIT 2 EXAMINATIONS 


Class 2 Class 3 





Low 


* Split-half method; corrected for length. 
» Above and below, median, respectively. 


one comparison to another. The consistency 
of direction of difference does not result from 
holding the membership of the groups con- 
stant. The present findings point to the need 
for an investigation which would include 
enough cases so that small differences in co- 
efficients of this magnitude could be consid- 
ered significant.’® 

It should be pointed out, however, that the 
technique used in the present investigation 
for estimating the relative variability of the 
two groups is crude. The split-half reliability 
coefficient depends so largely upon the num- 
ber of items that variability of the individual’s 
response is a relatively minor factor, unless 
it is known that corresponding items in the 
two halves of the test are equivalent in 
difficulty. 

The method of split-half reliability was 
selected for the present study because it could 
be used with the experimental material which 
was constructed as regular classroom achieve- 
ment test material. To have used more re- 
fined methods for estimating variability would 
have necessitated materials—series of sub- 
tests consisting of items of equivalent diffi- 
culty—which would have been in opposition 
to the requirement of regular achievement 
examinations. The present investigation pre- 
sents a clue for further work with more 
refined methods of estimating variability of 
individual response. 


* An original sample (before the ‘middle’ group were ex- 
cluded) of over 400 cases would be needed. In order to get a 
sufficient separation of high and low tension groups, it would 
be well to use only the u and lower fourths (on tension 
scores) of such a group. If the anticipated correlations were, 
for example, .95 and .97, the number of cases for 


x necessary 

the difference of .02 to be significant at the 5 per cent level 
undred in each sample. For a 

method predicting the number of cases necessary see 

Lindquist, op. cit., pp. 216-17. 


High Low 
Unit 1 Examination 
92 . 95 


orl ll 14 
6.7 6.6 


Unit 2 Examination 
. 89 . 98 
1l 11 
4.5 5.2 


High 


TENSIONS AND PREDICTIONS OF 
EXAMINATION SCORES 


In the statement of objectives of this inves- 
tigation, item (6) proposed: “to examine the 
predictability of test scores for ‘high tension 
groups’ and for ‘low tension groups’.” This 
objective is based on the hypothesis that the 
scores of those who have high tensions at the 
time of an examination may be less predict- 
able than the scores of those who show less 
tension, when the prediction in each case is 
based upon the same factors. Evidence con- 
cerning the relationship between intensity of 
tension and predictability of examination re- 
sults should be of great importance in terms 
of use of examination results. 

Two different predictions are discussed in 
the following order: (1) the prediction of 
Final examination scores from Semester 
scores; (2) the prediction of Final examina- 
tion scores by means of a multiple regression 
using American Council Psychological Exam- 
ination scores, Reavis—Breslich Arithmetic 
Test scores, the Semester examination scores, 
and, of course, the Final examination scores. 

Because the Semester examination and the 
Final examination are of similar difficulty and 
importance and differ mostly in amount of 
material sampled, one should expect fairly 
high predictive value from the former to the 
latter. The interest of the present investiga- 
tion, however, is not in the absolute value of 
the prediction. It is in the difference of pre- 
dictability between those with high tensions 
and those with low tensions. 

As in the previous parts of this investiga- 
tion, high and low tension groups were 
selected on the basis of scores on the ques- 
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tionnaire. In this instance an average score 
for two administrations of the questionnaire 
was used. The two administrations were those 
in the Semester situation and in the Final 
situation, since it seems reasonable to suppose 
that prediction might be affected by tensions 
at either time. Three sets of high and low 
tension groups were used: (1) all of those 
who scored above the mean of the entire 
group, thirty-nine cases, and all of those who 
scored below this mean, thirty-eight cases; 
(2) those who scored 36 or above, twenty- 
nine cases, and those who scored 30 or below, 
thirty cases; (3) those who scored 37 or 
above, twenty-one cases, and those who scored 
28 or below, twenty-two cases. Sets (2) and 
(3) were used in order to see if widening the 
gap between high and low groups on the ten- 
sion scale would affect the differences in pre- 
diction. 


For each of these groups—three high and 
three low in tension scores—three statistics 
are presented in Table XIV: (1) the correla- 
tion between Semester examination scores and 
Final examination scores; (2) the standard 
deviation for the group on the Final exam- 
ination; (3) the standard error of estimate 
in predicting Final from Semester. The first 
of these shows the degree of association be- 
tween the two variables. The second, together 
with the first, is used in obtaining the third, 
which is a measure of the value of the pre- 
diction: the higher the standard error of esti- 
mate of a prediction, the poorer is the pre- 
diction. Groups to be compared have the same 
group identification number in the Table; 
namely, high tension group 1 should be com- 
pared with low tension group 1, etc. The data 
in the Table have been so arranged that the 
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size of the groups decreases as one reads up 
or down from the center of the Table, just as 
the size of the extreme group decreases as one 
moves up or down from the center of the 
tension scale. 

The correlation coefficients for the high 
tension groups (Table XIV) decrease in mag. 
nitude as the group selection moves toward 
the upper end of the tension scale; that is, as 
more of the middle group are excluded from 
the high group. The coefficients for the low 
tension groups increase as the selection js 
made farther down the tension scale (going 
from group 1 to group 2 or group 3). Com- 
parisons between correlations for paired high- 
and low-tension groups show that the high 
tension group produces a smaller coefficient in 
each case than does the low tension group. 
In other words the high tension group pre- 
sented a lower association between Semester 
and Final scores than did the low tension 
group, and this difference was increased when 
the selection was such that the groups repre- 
sented greater divergence in tension. 

The differences between coefficients for 
paired high and low groups are not statist- 
ically significant. The greatest difference pre- 
sented is that between the number three 
groups: high group, .52; low group, .76. The 
z-values for these two coefficients are .58 and 
1.00, respectively.” The difference between 
z’s is .42 and the standard error of the dif- 
ference is .33. The ratio of difference to 
standard error of difference, 1.27, is approxi- 
mately at the 20 per cent level of confidence. 
Lack of significance is of less importance than 
it might be under other circumstances because 
of consistent trends in the data. Once again, 

* Lindquist, op. cit., p. 215. 


TABLE XIV 


STANDARD ERROR OF ESTIMATE AND CORRELATION COEFFICIENT IN PREDICTING FINAL FROM SEMES- 
TER SCORES, AND STANDARD DEVIATION ON FINAL EXAMINATION FOR HIGH AND LOW 
TENSION GROUPS AND FoR TOTAL 


Number 


Cases 


38 33 or below By 
30 30 or below ott 
22 28 or below . 76 


Tension 
Tension Group of : Score 
Limits 


21 37 or above 
29 36 or above 


Standard 
Deviation 
on Final 


Correlation 
Coefficient 


Standard 
Error of 
Estimate 


12.7 
11.8 
10.7 


10.2 
10.2 
10.5 


14.9 
15.0 
13.9 


14.4 
15.9 
14.4 


0. 52 
.61 
39 34 or above . 63 





Total Group-_----- 77 


14.9 10.6 





6 « 
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it seems reasonable to conclude that signifi- 
cance would be clearer, if a greater number 
of cases at the extremes could be utilized. 
The standard errors of estimate in Table 
XIV reveal somewhat the same thing as do the 
correlations. The standard errors for the high 
tension groups are all greater than those for 
the low tension groups. The difference be- 
tween the standard error of estimate for high- 
tension group 1 and the standard error of 
estimate for low-tension group 1 is 0.5; in the 
group 2 comparison (farther out on the ten- 
sion scale) this difference is 1.6; in the 
group 3 comparison the difference is increased 
to 2.2. The predictions for the high tension 
group are of less value than the predictions 
for the low tension group, and as the selection 
of groups is made closer to the extremes of 
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Council Psychological Examination,”* the 
Reavis—Breslich Arithmetic Tests,?* and both 
Semester and Final examinations. The first 
two tests were administered to these pupils 
early in the first semester of the same year as 
the present investigation. It was decided to 
use these two scores plus the Semester exam- 
ination scores to predict the Final examina- 
tion scores, and then to examine the differ- 
ences between predicted and obtained scores 
in relation to intensity of tension. 

The intercorrelations, means, and standard 
deviations for the four examinations, Amer- 
ican Council, Reavis—Breslich, Semester, and 
Final, are presented in Table XV. Denoting 
the four variables in raw score form by capital 
first letters (A, R, S, F); in standard score 
form by small letters (a, 7, s, f); and the 


TABLE XV 


INTERCORRELATIONS, MEANS, AND STANDARD DEVIATIONS FOR THE AMERICAN COUNCIL, REAVIS- 
BRESLICH, SEMESTER, AND FINAL EXAMINATIONS FOR SEVENTY-FouR CASES 


American 
Examination 


the tension scale, the differences in error of 
prediction are increased, the low tension 
group always having the smaller standard 
error. 
It should be noted that, in a sense, the data 
concerning the “tension-prediction” relation- 
ship add support to the conclusions concern- 
ing the “tension-reliability” relationship which 
was discussed in the previous section. Split- 
half reliability coefficients are, in effect, a 
basis for prediction of scores on one half of 
a test from scores on the other half, or, look- 
ing at it in another way, the successfulness of 
prediction of scores on one test from those on 
another (when the materials are highly re- 
lated in content) is a measure of reliability. 
This inter-consistency for both sets of data is 
another count against attributing the differ- 
ences to chance. This is especially true since 
the groups in the reliability study were 
selected on the basis of scores on one ques- 
tionnaire, while the groups for the prediction 
study were selected on the basis of average 
scores on two questionnaires which did not 
show a high intercorrelation. 

Scores were available for seventy-four of 
the pupils in the ninth grade on the American 


Reavis- 


Standard 
Mean Deviation 
96.1 23.0 
62.2 16.2 
29.9 12.1 
33.8 15.1 


Semester 


predicted variable by a bar over the letter 


(F or J), the regression equations may be: 
written: 


f =.300 + .117 + .515, 
and 


F— 117A + .11R + .64S — 8.0. 


The multiple correlation is .77. By means of 
the raw score regression equation, an F score 
(predicted score on Final Examination) was 
computed for each pupil in the group of 
seventy-seven for whom there were tension 
scores from the questionnaire in the Final 
situation. For each pupil there was also, of 
course, an obtained score, F, on the Final 
examination. The absolute value of the dif- 
ference between predicted score and obtained 
score was computed for each pupil. This 
value, |(F —F)|, represents the deviation 
from prediction of the pupil’s obtained score. 


" American Council on ian Psychological 
tion, College 


Educat: Examine- 
Freshmen, 1941 edition. Prepared by L. L. 
Thurstone and Thelma Gwinn Thurstone. Washington, D. C.: 
American Council on Education, 1941. 
% Diagnostic Tests in the F 

metic and Problem Sol 

By W. C. Reavis and E. R. Breslich 
Cc Press, 1927. 


‘undamental Operations of Arith- 
ving, Form A (for Grades VII to XII). 
. Chicago: University of 
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It was these scores which were examined in 
relation to intensity of tension on the hypoth- 
esis that pupils having higher tensions at the 
time of an examination produce examination 
results which tend to be less predictable tian 
the examination results of pupils who exhibit 
lower tensions. 


The pupils were divided into a high tension 
group and a low tension group on the basis 
of their scores on the questionnaire which was 
administered at the time of the Final Exam- 
ination. These groups consisted of thirty-nine 
and thirty-eight cases, respectively. Those in 
the high group had scores on the tension scale 
of 35 or above; those in the low group had 
scores on the tension scale of 34 or below. 
The mean value of |(F — F)| and the stand- 
ard error of the mean were found for each 
group. These were used to compute the differ- 
ence between the means for the groups, the 
standard error of the difference, and the ?¢- 
value of the difference. These statistics are 
presented in Table XVI. 
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questionnaire, respectively. In this case, how. 
ever, each group consisted of only those who 
had made examination scores within a certain 
narrow interval of the total examination score 
range. For example: all of the pupils who 
made from 60 to 64, inclusive, on the exam. 
ination were divided into two groups, those 
who made 35 and above on the questionnaire 
and those who made 34 or below on the ques- 
tionnaire; next, all of those who made from 
55 to 59, inclusive, on the examination were 
divided into a high tension group and a low 
tension group. This process was repeated for 
all intervals of 5 on the examination scores, 
There were twelve such intervals, starting 
with 60 to 64 and ending with 5 to 9. The 
average value of |(F — F)| for each of these 
groups is shown in Table 17. 

There are two intervals (50 to 54 and 55 
to 59) in which none of the pupils had ques- 
tionnaire scores above 34. In the remaining 
ten intervals, the high tension group shows a 
higher |(F — F)| value than the low tension 


TABLE XVI 


MEAN \(F-F)| VALUE AND STANDARD ERROR OF THE MEAN FOR HIGH AND FOR LOW TENSION 
GROUPS, AND DIFFERENCE BETWEEN MEANS, STANDARD ERROR, AND THE t-VALUE 
OF THE DIFFERENCE 


Standard 
Error of 
Mean 


1.04 
0.77 


Mean 
\(F-F)| 
8.6 


6.4 


It is evident that the high tension group 
had a higher average deviation of obtained 
score from predicted score, |(¥F — F)|, than 
that of the low tension group. The difference 
is significant at the 7 per cent level of confi- 
dence. In other words, when the scores on the 
Final mathematics examination were pre- 
dicted from scores on a psychological exam- 
ination, a diagnostic arithmetic examination, 
and the Semester mathematics examination, 
those pupils who showed high tensions devi- 
ated more from this predicted score than did 
those pupils who showed low tensions. 

As a further check on this “tension-predic- 
tion” relationship, the average |(¥ — F)| was 
computed for high tension groups and for low 
tension groups which had comparable exam- 
ination scores. The groups were selected in 
the same fashion as before: 35 or above on 
the questionnaire and 34 or below on the 


Difference 
Between 
Means 


Standard t-value 
Error of of 
Difference Difference 


Level of 
Confidence 


2.2 1.2 1.8 T% 


group shows, except in the one interval of 
examination scores, 15 to 19. It should be 
noted that the larger deviation from predic- 
tion for the high tension group holds regard- 
less of whether the high tension group or the 
low tension group has the greater number of 
cases. It seems that the conclusions drawn 
from the data in Table XVI are supported by 
the direction of differences shown at all exam- 
ination score levels in Table XVII. The lack 
of “predictability” is not associated with size 
of examination score; it is associated with in- 
tensity of tension as measured by the ques- 
tionnaire. 


In summary, all of the results in the study 
of the relationship between intensity of ten- 
sion and predictability of examination results 
point to the fact that those showing higher 
tensions at the time of an examination pro- 
duced examination results which tended to 
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TABLE XVII 


Averace |(F-F)|* VALUE AND NUMBER OF 
CASES FOR HIGH TENSION GROUP AND FoR Low 


Tension GROUP WITHIN A GIVEN SCORE 
INTERVAL ON THE FINAL EXAMINATION 


Examination 
Score 


Tension Group 
igh Ww 


N |(F-F)| 


= 
on 


AH OMAN COM 
Acanwnoanananoocr 
Soa 

SCAN PAM AMD Ws 
CHWONNH COQ ce 
DKK AON Zz 


*Absolute difference of predicted score from 
obtained score. 


deviate further from prediction than did the 
examination results of those who gave evi- 
dence of lower tensions, when tensions were 
measured by means of a questionnaire and 
the examinations were in ninth grade algebra. 
These results support the general hypothesis 
that pupils who undergo higher tensions than 
other pupils at the time of an examination 
tend, on the whole, to turn out a “‘less stand- 
ard” examination product. 


CONCLUSIONS AND DISCUSSIONS 


The statement of the problem had two 
parts: (1) to develop a technique by which 
the pupils taking an examination can be dif- 
ferentiated in terms of intensity of tension 
directed toward the examination; (2) to 
search for relationships between such tension 
scores and the examination results. It was 
pointed out that the technique referred to in 
(1) should be applicable to regular classroom 
examination situations. Three areas of rela- 
tionship were cited in connection with part 
(2) of the purpose of the study. 

The following conclusions are based on the 
findings of the study: 


1. The questionnaire which was developed 
for this study does afford a technique for dif- 
ferentiating pupils in terms of examination 
tensions. 

a. This technique is a practical one for 
classroom examination situations in that 
it is easy to administer, it is easy to 
score, it consumes little time in the ex- 
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amination period, and it may be applied 
at the time and in the setting of the 
examination. 

. The questionnaire results do pertain to 
the particular examination situation for 
which ‘the questionnaire is used, not to 
examinations in general. 

. The questionnaire which was developed 
is sufficiently reliable to justify its use in 
the selection of high tension and low 
tension groups. 

. Certain fundamental characteristics of 
validity are present in the questionnaire. 

. The questionnaire is valid in the sense 
that results of the questionnaire corre- 
spond to the results of a speech-motor 
disturbance technique and to the results 
of a respiration (rate) index of affect, 
when the latter two are used for the 
same purpose as the questionnaire. 


2. Concomitant variation between tension 
scores and examination scores tends to be in- 
verse, but the magnitude of this relationship, 
as shown by correlation, is so small that little 
importance can be attached to the degree of 
relationship. High tensions do not necessarily 
accompany low examination scores, nor con- 
trariwise. 

3. There is a definite tendency for a “high 
tension” group, as selected by the question- 
naire, to yield a lower reliability coefficient on 
an examination than a “low tension” group 
shows on the same examination. Although the 
magnitude of the difference in reliability co- 
efficients is too small to be statistically sig- 
nificant for the size of the groups used, the 
direction of the difference shows a high con- 
sistency. 

4. Pupils showing higher tensions, as meas- 
ured by the questionnaire, at the time of an 
examination produce examination results 
which tend to deviate further from prediction 
than do the examination results of those who 
give evidence of lower tensions. 


a. This appears to be true at each score 
level within the examination. 


DISCUSSION 


It will be remembered that the question- 
naire used in this study was developed as a 
practical technique, in terms of the classroom 
examination situation, for differentiating 
pupils in regard to tensions directed toward 
an examination situation. An earlier study by 





162 JOURNAL OF EXPERIMENTAL EDUCATION 


Brown* showed that the technique had pos- 
sibilities; the present investigation has shown 
that this particular example of the technique 
meets certain criteria of practicability, reli- 
ability, and validity. It should not be consid- 
ered, however, as the method of obtaining the 
desired ends. 

If the questionnaire were to be used in fur- 
ther experimentation, certain cautions should 
be observed. The questionnaire used in the 
present work was constructed specifically for 
the group. being tested; that is, essay state- 
ments by this group (a representative sample) 
were used as a basis of construction. If the 
technique were to be used for other groups, 
essay statements by those groups should be 
used in the construction of the schedule as a 
precaution against failure to take into account 
different forms of expression and modes of 
thinking. Undoubtedly, many of the items 
would remain the same for different groups, 
but this precaution should be observed. 

Also, it would be well to attempt to expand 
the questionnaire: (1) in order to account 
for a greater number of specific indices of 
tension, so that the entire picture for each 
pupil would be more complete; (2) so that 
the length being greater, the reliability of the 
instrument might be increased sufficiently to 
justify greater discrimination. 

One other point in connection with the use 
of the questionnaire technique bears empha- 
sis: it is especially necessary in using this type 
of technique to gain rapport with the exam- 
inees. In the present investigation every effort 
was made to obtain real cooperation on the 
part of the pupils. The successfulness of this 
effort was due largely to respect which the 
three teachers had engendered in their pupils 
previously, and to their straightforward man- 
ner of enlisting the pupils’ willing partici- 
pation. 

Except in one area, the data from the in- 
vestigation of relationships between tension 
and examination results are suggestive rather 
than conclusive. The one area is that of pre- 
dictability of examination scores. The lack of 
finality in some of the data is unfortunate, of 
course, in terms of completely solving prob- 
lems, but it is not surprising in consideration 
of scarcity of previous work in this field. 

The importance of locating and describing 
these relationships is great. Examinations are 


% Charles H. Brown, “Emotional Reactions Before Examina- 
yess), —_ of a Questionnaire,” Journal of Psychology, 
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being utilized more and more as bases for cer. 
tification of accomplishment, for directing 
and counseling the individual in his plans for 
further work, for alteration and construction 
of curriculums, and for comparing the effec. 
tiveness of procedures. It is important that 
all aspects of the examination situation be 
understood, and that aberrations in results be 
interpreted properly. 

This investigation offers evidence which 
should allow for an improved control of the 
use and interpretations of examination re. 
sults. Sufficient clues have been given to 
justify more extensive investigation in this 
area. It would be well to expand the study to 
include examination material in other subject 
matter fields and at other school levels. Re. 
search should be undertaken to show whether 
or not there is change in relative amount of 
tension at different grade levels, and, if so, 
whether or not the magnitude of the relation- 
ships (e.g., predictability variation as associ- 
ated with intensity of tension) is affected. It 
may well be that with greater tensions the 
“reliability” relationship might be of more 
apparent significance. In any event, the re- 
sults of this study indicate that the value of 
interpretation of examination results may be 
enhanced by acquiring knowledge of the inci- 
dence and magnitude of tensions. 


QUESTIONNAIRE 
Mathematics 3 


Here are some statements concerning your 
feelings about today’s test. In each state- 
ment you are to choose one of the three 
phrases marked a., b., or c. In each statement 
choose the one phrase (a., b., or c.) which 
will make the sentence most nearly describe 
your feeling. Indicate your choice by making 
a circle around the letter of that phrase. 

Example: In preparing for the test I 


studied the night before the test—a. much ' 


later than usual b. not so late as usual 
c. about the same time as usual. 

(The person who marked this drew a circle 
around b. He felt that he studied not so late 
as usual the night before the test.) 


1. On this test I felt that I worked .. . 
a. very fast (2)* 
b. just about as usual (1) 
c. rather slowly (3) 


* Numbers in parentheses are the weights given in_ scoring. 
These did not appear, of course, on the questionnaires used 
by the pupils. 
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2. While taking the test I felt .. . 
a. very nervous (3) 
b. somewhat nervous (2) 
c. not at all nervous (1) 


. Near the end of the test I became .. . 
a. somewhat rattled (2) 

b. not at all rattled (1) 
c. very rattled (3) 

. Yesterday I worried about this test . . . 
a. not at all (1) 

b. some (2) 
c. a lot (3) 

. For the amount of time we had to work 
on it, the test seemed to me to be .. . 
a. about the right length (1) 

b. much too long (3) 
c..too short (2) 

5. Just before we started the test I felt .. . 
a. very jittery (3) 

b. not at all jittery (1) 
c. slightly jittery (2) 

. As soon as I began to work on the test I 

felt... 

a. very calm (1) 

b. not at all calm (3) 
c. fairly calm (2) 

. If this had been a non-test situation, I 
feel that I would have done these same 
problems .. . 

a. better (3) 
b. about the same (1) 
c. less well (2) 

. While taking the test I had a feeling that 

I was being .. . 

a. very hurried (3) 

b. somewhat hurried (2) 
c. not at‘all hurried (1) 

. Right at the first of the test I was .. . 
a. very nervous (3) 

b. not at all nervous (1) 
c. somewhat nervous (2) 

. While taking the test I felt afraid that I 
was going to make a mistake. I felt this 
RR 
a. not at all (1) 

b. somewhat (2) 
c. very much (3) 

. When I started to take this test, I felt 
that I had forgotten what I had learned 
in my class work. I felt this way .. . 
a. very much (3) 

b. not at all (1) 
c. somewhat (2) 
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. While taking the test I worried about 
having time to finish it. I worried about 


a. very much (3) 
b. somewhat (2) 
c. not at all (1) 

. Before the test started I felt that I knew 

the material. I felt... 

a. quite confident that I knew it (1) 

b. not at all confident that I knew it (3) 
c. fairly confident that I knew it (2) _ 

. The fact that this test is important was 
in my mind... 

a. all of the time I was taking it (3) 
b. some of the time I was takingit (2) 
c. none of the time I was taking it (1) 

. In comparing my feelings in connection 
with this test with my feelings at the time 
of other tests, on this one I felt . . . 

a. more nervous or upset than on the 
others (3) 

b. less nervous or upset than on the 
others (1) 

c. about the same as I always do (2) 


Use the rest of this paper, front or back, for 


any other comments you wish to make. 
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THE USE OF THE METHOD OF RUNS FOR TESTING 
THE RANDOMNESS OF THE ORDER 
OF EXAMINATION ITEMS 


TEOBALDO CASANOVA 
University of Puerto Rico 


Text-books in measurement contain the 
requirement that the position of the correct 
response in examinations of the multiple- 
choice type should be such that the elements 
of the resulting scoring “ey are arranged in 
random order. Tossing coins or dice is often 
suggested as a means of accomplishing this 
purpose, but no method has been given for 
testing the randomness of the order. While 
several methods may be used to test compli- 
ance with this requirement, the method of 
runs is specially adapted for detecting logical 
arrangements or systematizations, obtained 
from purposeful design or from careless atti- 
tude toward this point, which depart sig- 
nificantly from the expected value of the 
distribution. 

Take, for example, the scoring key of a 
15-item 5-choice test: 


2,2 ,25354,1,5,45451,353535352, 


The arrangement is composed of one run 
of length 3 of the second choice, one run of 
length 1 of each one of the third, fourth, 
first, and fifth choices, one run of length 2 of 
the fourth choice, one run of length 1 of the 
first choice, one run of 4 of the third choice, 
and finally, one run of 1 of the second choice. 
If r,; stands for runs of choice i and of length 
j, the arrangement may be more briefly de- 
scribed by stating the frequencies of the 1,,’s. 
Thus, 

Vy 2, Fay = 1, Vag = 1, 33 1,7 = 1, 
To, = 1, Tgp = 1, 15, = 1 where the remaining 
r\;’s equal zero, and 

<r 
J 
n being the number of items in the test. 

The distribution of runs has been studied 
recently by several investigators and has been 
found to become sensibly normal as the value 
of m increases. A, M. Mood* has given for- 
mulas for the first and second moments of the 


* Mood, A. M. “The Distribution Theory of Runs.” The 
Annals of Mathematical Statistics, 1940, 11, 367-392. 


distribution, which are adapted here in some- 
what abridged notation, simplified, tabled, 
and applied. 

Suppose that it is desired to test the scor- 
ing key of a 100-item true-false examination. 
The number of different scoring keys that can 
be made is, ** 27°° —= 126,765 & 10”. 

The total number of runs, 7; of either the 
true or the false choice range from 50 runs of 
length 1, to 1 run of length 100, with the 
mean, M,, at 


M,,=npq+? (1) 


where p is the reciprocal of the number of 
choices in the test, and p+ g=1. Thus, 


2 
I I I 
M,, = 100 xixt+(4) = 25.25 


The variance of the runs of all lengths of 
either choice is given by 


oy, * == np(1 — 4p + Op’ — 3f°) + 
°(3 — 8p + 5?*) 


which for a 100-item true-false test is, 
o,; == 6.31, and o = 2.51 


(2) 


Here the null hypothesis is that the posi- 
tion of the correct choice is arranged in 
random order, that is, that the scoring key 
is a random arrangement. If the one per cent 
point is adopted as the limit beyond which 
the null hypothesis is disproved, and since the 
distribution of runs is normal, it is inferred 
from the normal probability tables that scor- 
ings keys with total number of runs of all 
lengths of either choice beyond the limits 
+-2.330 and —2.330 are not random arrange- 
ments. For a Mean of 25.25 and a o of 2.51, 
these limits are 31.10 and 19.40. This means 
that scoring keys with total number of runs 
of 32 or higher and with total number of runs 
of 19 or lower are not random arrangements. 


** The exact number is 1, 267, 650, 600, 228, 229, 401, 
496, 703, 205, 376. 
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This test is readily sensitive to systematiza- 
tions resulting from an effort to avoid the use 
of runs of 2 or longer, which may lead to a 
total number of runs greater than 31. On the 
other hand, the tendency to use exceedingly 
long runs in one choice, may lead to a total 
number of runs lower than 20 in both choices. 

Substituting in (1) and (2) for a 100-item 
§-choice examination, M,, = 100 X 1/5 X 
4/5 + (1/5)* = 16.04 and 


o,, *= 8.38 


Randomness may also be tested through 
runs of any determinate length, j. The gen- 
eral formulas are: 


M., =?) q[(n—j—1)qg+2] (3) 


a7; = p* q’ [(n — 2j)* ¢? + (n — 2) q 
(14:50) +6F)4+M,, —M, (4) 


For the 100-item 5-choice test, and for runs 
of length 2, 


2 
My = (£) x4{(10—2—14 
+ 2] = 2.547 


(Nemo) 


pls 5 S| 
i o(4)(«+8)+5] 


+ 2.547 — 2.547° = 2.25 


which shows that there should not be more 
than 6 runs of length 2 in any one of the 5 
choices. The average number of runs of 
lengths 1, 2, 3, 4, and 5, in a 100-item true- 
false test, obtained by substituting in (3), 
are respectively 12.75, 6.31, 3.13, 1.55, and 
-77. Making the necessary adjustment so that 
. n 

a 
] 


it could be said, roughly speaking, that the 
ideal scoring key for such a test should con- 
tain 50 true and 50 false items, and 13, 6, 4, 
2, and 1 runs of lengths 1 to 5 respectively— 
a total of 26 runs for each choice. Since the 
number of arrangements of 26 things, of 
which 13 are of one kind, 6 of another, etc., 
is given by the coefficient of a** 5° c* d* e in 
the expansion of (2+ 56+c¢+d-+ e)**, or 
the multinomial coefficient 
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26! 
13! 6! 41 1! 2! 





the number of ways in which this ideal scor- 
ing key can be obtained is 
26! 26! 


2X TST 6! 41 21* 131 6! 41 2) 





The factor 2 is due to the fact that there are 
2 possible arrangements for each combination, 
one forward and one backward, when the 
number of runs in both choices are equal, 
This expression represents a 13-digit whole 
number, and it shows that the number is so 
large, that other requirements for the order 
of the items such as difficulty and subject. 
matter can be easily met even when one de- 
cides to follow this ideal combination, or a 
similar one. 


Some texts suggest that the correct response 
be evenly distributed among the choices. If 
it is agreed that a 100-item true-false test is 
to consist of 50 true and 50 false items, the 
number of different scoring keys is reduced to 

100! 
50! 50! 


This represents a 30-digit number, and shows 
that the 50-50 rule is no serious inconveni- 
ent in the work of the test builder. Here the 
proportion of each choice in the scoring key 
is not the result of random sampling, but it 
is fixed beforehand. If m, is the number fixed 
for the #th choice, the average number of 
runs of all lengths for that choice is 


pee n, (n—n, +1) 





(5) 


and 
_ ny (n—m, 1) 
re nn® 





Cr 


(6) 

where the notation m‘*’ has the usual meaning 
n'*)——n (n—1) (n—2) 

factors 


For the 100-item true-false test, 
Mam 50 (100 — 50 + 1) 
100 





= 25.50 





»__ 50 X 49 X $1 X 50 _ 


100 X 100 X 99 6.3 


against 25.25 and 6.31 obtained through 
formulas (1) and (2). Formulas (5) and 
(6) are applicable to items with any number 
of choices. For a 2-choice test, such as our 
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example of a roo-item true-false test, the 
mean and variance of runs of length j for 
either choice are 

(n— my +1) 


M,= nirth 


(7) 





2 (n—m,) (n—m, + 1) 0, 
Ory n+ 2 


+ M,, —Mr' (8) 





Substituting in these, for runs of length 2, 


M,.. 51 X $0 X $0 X 49 
100 X 99 X 98 


50 X 49 X 51 X 50 X 50 X 490 X 
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TABLE 1 


VARIANCE OF THE NUMBER OF RUNS OF ALL 
LENGTHS IN A TEST OF K CHOICES AS 
GIVEN BY FORMULA (2) 


o,, XK 
n+1 
6n+8 


2in + 21 
52n + 40 


length of the run. This is done in Tables 1 
and 2. In both of them K stands for the 
number of choices, and the entries must be 
divided by some power of K. For example, 
the variance of the number of runs of all 














48 X 47 + 6.44 — 6.44? 





rij == 


100 X 99 X 98 X 97 


against 6.31 and 5.14 obtained through (3) 
and (4). 

It will be noticed that these results are 
quite similar to those obtained from formulas 
(1) to (4), in which the value of m, was not 
fixed but was supposed to be randomly drawn 
from the population. The latter are quite 
general and may be used for values of m, 
other than so. For example, they may be 
used to test a scoring key of 60 true and 40 
false items, if these numbers had been fixed 
beforehand; but if they arose in a random 
fashion, formulas (1) to (4) more properly 
apply. These are simpler and are likely to be 
the most useful because randomization is 
more complete when the m, values are the 
result of random sampling. 





X 96 X 95 


= 4.37 


lengths for each choice in a 100-item 5-choice 
test is ' 
2 52 X 100-+- 40 5240 
o;;= 4 “ie = 8.38 

which is the same value obtained above by 
substituting in (2). While the entries of 
Table 1 must be divided by K* in order to 
obtain the variance of the runs of all lengths, 
those of Table 2 must be divided by K*; * * 
to obtain the variance of runs of length j. 
Thus, the variance of the number of runs of 
length 2 in each choice of a 100-item 5-choice 
test is, from Table 2, 





:__ 8848 K 100 — 3432 881368 
- 5° 390625 


Cr 2.26 





TABLE 2 


VARIANCE OF THE NUMBER OF RUNS OF LENGTH j IN A TEST OF k CHOICES 
As GIVEN BY ForRMULA (4)* 


2 3 
7n+8 76n+52 
13n+9 
27n 
57n—51 
119n—224 


* The entry is eK, x Mie, 





Formulas (1) and.(3) are simple enough, 
but (2) and (4), specially (4), require con- 
siderable arithmetical work for large values 
of j. They may be easily tabled by substi- 
tuting for the number of choices and the 


K 
4 5 
387n +240 1360n +776 
— 3432 


146619n—486700 1247812n—4358840 


this value being slightly more accurate than 
that of 2.25 obtained above by substituting 
in Formula (4). Again, from Table 2, the 
variance of the number of runs of length 5 
in a 100-item 5-choice test is 
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Since the powers of K are given in Barlow’s 
Tables and elsewhere, this is a simple opera- 
tion. The amount of computational labor 
saved through the use of Table 2 may be 
appreciated by calculating this last value by 
substituting in Formula (4). 

Finally, the mean and variance for the 
total number of runs of all lengths are 


M,,— 7 (K—1) (9) 


2 M, 
= a (10) 


For the 100-item true-false test, 


M,.-—— (2 — 1) = 50.00 


Or me 25.00 


For the 5-choice test, 


__ 100 


M,.= ; (5 — 1) = 80.00 


io 16.00 
5 


This approximate formulas may serve to 
check the results obtained for each choice, 
Thus, M,, was 25.25 for each choice, com. 
pared to a value of 50.00 given by (9) for 
both choices. For the 5-choice test, M,, was 
16.04 for each choice, compared to 80.00 for 
the five choices. However, in order to check 
the variances, the intercorrelations must be 
taken into account, and these are not dis. 
cussed in this paper. Formulas for the co 
variances are given in the work of Mood cited 
above. 
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THE MEASUREMENT OF RANDOMNESS IN TEST ITEMS* 


TEOBALDO CASANOVA 
University of Puerto Rico 


1. THE MEANING OF RANDOMNESS 


The failure of the Literary Digest’s Poll to 
predict the Presidential elections of 1936 after 
asking millions of voters for their voting in- 
tentions, and the very close forecast of the 
same election made by Fortune Magazine’s 


Quarterly Survey of Public Opinion with a - 


small sample of 4,000 voters have served to 
further stimulate research that was already 
under way in the representativeness of sam- 
ples. Methods of avoiding bias dnd thus 
selecting truly representative samples in 
human populations have received the greatest 
attention, as excellent methods of testing rep- 
resentativeness were already available. More 
recent is the development of methods of test- 
ing randomness in series of observations, 
measurements, or series of digits or numbers 
in the field of quality control of mass produc- 
tion in industry, in other fields of economics, 
and in biology. 


For the purpose of measuring one or more 
attributes of an entire population through the 
measurement of the same attributes in a sam- 
ple of that population, the sample is said to 
be representative if the measurements taken 
on the sample are equivalent to those taken 
on the entire population, except for the fact 
that the measurements on the sample are sub- 
ject to an error that decreases as the size of 
the sample increases, and which is known as 
the probable error. However, if the sample is 
not representative, that is, if it is not a ran- 
dom sample of the population, the probable 
error falls short of indicating the actual error 
because there are systematic errors. So that 
there is no use in applying probable error 
formulas in the presence of much larger sys- 
tematic errors whose size is seldom known. 
This requires the selection of samples that are 
truly representative of the parent population, 
and some quite elaborate methods for so 
doing, in use at the present time, are stratifi- 
cation, double sampling, subsampling and 
subdivision. 

* This article was written after the other 


4G 4 in this number had been accepted for aigaett BB ms 


In series of observations, digits, or num- 
bers, the problem has not been, generally 
speaking, that of selecting a‘series which is 
representative of the entire distribution, but 
that of testing for randomness a series ob- 
tained from economical or biological data. 
One or more observed series of magnitudes as 
represented by numbers or single digits are 
either elements or samples ‘from the entire 
random distribution, and the tests of random- 
ness are similar to the tests of representative- 
ness in human populations in that the tests 
of significance and the null hypothesis to be 
tested are precisely the same. Here the total 
population of which the observed series is an 
element, is made up of all the possible com- 
binations, arrangements, or permutations, 
that can be formed from the digits in the 
single series, under the specified conditions; 
and if the statistics of the observed series and 
those of the entire distribution of all possible 
arrangements differ significantly, the null 
hypothesis is rejected. As the digits are in 
random order in the total distribution, tests 
of randomness are of essentially the same 
nature as tests of representativeness. More- 
over, in the same manner that representative- 
ness of sample as to one trait furnishes no 
assurance of representativeness as to another 
trait, whether or not the latter trait is similar 
to the former, randomness as exhibited by any 
particular sequence of digits does not insure 
randomness as determined by any other sim- 
ilar or dissimilar sequence, as far as it is 
known at the present time. 


One often quoted definition of randomness 
is that given by von Mises (11): (a) “The 
relative frequencies of particular attributes of 
single elements of the collective tend to fixed 
limits. (b) The fixed limits are not affected 
by any place selection.” Take for example 
the scoring key of a 100-item true-false test. 
If the proportion of true items tends to a 
fixed limit, say .50, (a) is satisfied. In this 
case the whole scoring key is the collective. 
Then (b) means that if the first 20 items, 
that is, items 1 to 20, are selected, the pro- 
portion of true items approaches .50, and 
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that the limit of this proportion is not changed 
if instead of taking items 1 to 20, now items 
2 to 21, 50 to 69, or any other 20 consecutive 
items are selected. There would be 81 such 
series of consecutive items, the first one in- 
cluding items 1 to 20, and the last one items 
81 to roo. Of course, the proportion of true 
items would vary from series to series, from 
which the “between variance” may be ob- 
tained. After obtaining the variance within 
the series, the null hypothesis that the 81 
samples belong to the same population can be 
tested through analysis of variance. This 
would be a test of randomness following von 
Mises’ definition, but it would be a too time- 
consuming procedure for any practical pur- 
pose, and it is not known whether it would 
constitute an absolute test of randomness, so 
that it has not been used as such. 


Up to the present time randomness has 
been tested by comparing the frequency of 
any attribute in the given arrangements with 
the frequency in the total distribution of all 
possible arrangements. The attribute is gen- 
erally a single digit or number, or a sequence 
of two or more consecutive digits forming a 
well defined pattern. Most of the methods 
use a finite population with N arrangements 
or permutations, although Mood (12), among 
a few others, has adapted the formulas to 
asymptotic distributions. In dealing with 
short series or small sets of numbers, the 
entire random distribution may be quickly 
written down in full length. For example, in 
testing the randomness of several arrange- 
ments of the four digits 1, 2, 3, 4, no repeti- 
tions being allowed, the total number of dif- 
ferent permutations, V, is 4 — 24. The 24 
permutations may be easily written down and 
the number of sequences selected for the test 
may be simply counted, and the average or 
expected value and the variance calculated. 
Now, if the particular sequences are known 
to be normally distributed, or approximately 
so, the usual test of significance may be 
applied to the frequencies in the given or ob- 
served arrangements; or if the discrepancies 
between the observed and the expected se- 
quences are known to be distributed as chi 
square, it is in order to apply the chi square 
test with the proper number of degrees of 
freedom. In testing longer series the moments 
of the total random distribution are obtained, 
as a general rule, by the elementary algebra 
of combinations and permutations. Such com- 
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binations or permutations lead to binomial 
or multinomial distributions, and these have 
been found by Hoel (7) to approach the chi- 
square distribution. 

The need for testing the randomness of the 
order of test items arises from the fact that 
the implicit assumption of this randomness 
has been underlying our measurement tech- 
niques. Guessing corrections in common use 
have been aimed at correcting for pure chance 

ing. The well known correction for 
guessing in multiple choice tests, those made 
by Cureton and Dunlap (4), Conrad (3), 
Zubin (21) and others for correcting for 
guessing in matching and rearrangement or 
continuity tests are devices of this sort. Sey- 
eral empirical studies have shown that many 
times the reliability coefficient of corrected 
scores is not the same as that of the uncor- 
rected scores, and the writer (1) has offered 
theoretical proof of this fact. The purpose of 
guessing corrections has been to separate 
accomplishment from pure chance guessing, 
but besides accomplishment and guessing 
there is a third factor that may sometimes 
affect the size of a score, and whose effect has 
been neither accounted for nor eliminated 
from scores. If present, it is that part of a 
score that is due to the discovery of systema- 
tizations, regularly occurring designs, or 
periodic sequences that may exist in the en- 
tire response pattern or scoring key. Obvi- 
ously, this need not always be a perfectly con- 
scious or voluntary act, but it may occasion- 
ally happen that one unintentionally falls 
into the rhythm while thinking about the 
other details of the test. While there are no 
means for correcting for this bias, it may be 
eliminated through suitable tests of random- 
ness. 
It is hardly necessary to say that the 
grosser kind of systematization may be easily 
detected by mere inspection. Take, for in- 
stance, one of an extreme kind. Suppose a 
1o-response matching test in which the digits 
0 to 9 are to be used, and in which repetitions 
are allowed. Nobody would think of writing 
an item so that each one of the ten choices 
on the right-hand side matches the second 
choice on the left. The scoring key would 
show a run of length ten of digit two, like 
those described in (2). Since the number of 
different permutations of ten digits, allowing 
repetitions is 10°°, and since this particular 
permutation has a frequency of only one, the 
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probability of its occurrence in one permuta- 
tion is .coccccocer. Therefore, the arrange- 
ment is not supposed to have been drawn in 
the course of random sampling, and any test 
of randomness will reject it, as the null 
hypothesis of randomness is decidedly dis- 
proved with probabilities of .o1 or less. If 
the probability is .os or greater, the null 
hypothesis is not disproved. However, sys- 
tematizations of a more subtle kind require 
the use of tests of randomness, and the detec- 
tion of complex and extended designs might 
even demand the application of very powerful 
tests. Suppose that instead of one item we 
are now testing twelve 1o-choice matching 
items, or the scoring key of fifty 5-choice 
items. Here it is usually necessary to apply 
one or more tests of randomness. 

Since thus far randomness has been tested 
by comparing the frequency of any. attribute 
in the observed arrangement with that of the 
random distribution, and as the number of 
attributes in a series of digits is infinite, no 
combination of any number of tests can con- 
stitute an absolute test of randomness, and it 
is not known whether such a test can ever be 
made by any other method. Different sorts 
of systematizations are sensitive in various 
degrees to different kinds of tests, and the par- 
ticular test to be tried depends on the sort of 
systematization, pattern, or design that one 
is trying to detect, or whose presence one sus- 
pects. At this point it is important to know 
that it is the randomness of the order of the 
digits that is being tested here, and not the 
randomness of the sequence ‘selected for the 
test. The latter is only a criterion for com- 
parison, in the same manner as the trait on 
which the representativeness of a human pop- 
ulation is tested. In psychological measure- 
ment it is not known what kind of systematiz- 
ations do generally creep in the scoring keys 
of tests, and the whole matter must wait for 
future research. Perhaps these methods may 
later find their way into other fields of experi- 
mental and applied psychology such as learn- 
ing or practice curves, or any other series of 
quantitative or qualitative data, rank correla- 
tion, or test reliability. 

If some ready-made series of random num- 
bers are needed, they may be selected from 
Tippett’s Random Sampling Numbers (16). 
It contains 41,600 digits, most of which were 
taken at random from the census in England. 
They are printed in 26 pages, 1600 digits to 
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the page. Each page is made up of 32 columns 
and 50 rows, the columns being assembled in 
groups of four. These series have passed the | 
many tests of randomness that have been — 
applied to them. ' 

The tests that follow have been taken from 
the sources cited, except the first one which 
has been used for some time, and those in 
Sections 3, 4, 5, and 6, which are proposed by 
the writer, and which will serve as an intro- 
duction to the general method. In those of 
other authors, the formulas are derived here 
only when the derivations are not offered in 
the original sources, or when the derivations 
offered here are much simpler. The methods 
are explained in terms of digits for the sake 
of simplicity, but. they may be extended to 
numbers of any size. 


2. FREQUENCIES oF DicITs 


Simple and well known methods for testing 
the frequencies of digits in scoring keys where 
these are used to designate the various choices 
are already in common use. In a 50-item true- 
false test the probability for either the true 
or the false choice is .50, as both are equally 
likely, and the expected number of true re- 
sponses is 25. Suppose that there are 19 true 
items in the test. Then the problem is to de- 
termine whether this number differs signifi- 
cantly from the expected number, or 25. o; 
the standard deviation of a frequency, is given 
by the formula 


a= VpGN (1) 


where ~ is the probability, in this case .50, 
q==1— >, and N is the number of items. 
Substituting in it, o, = 3.54, and 


25— 19 
3-54 


The normal curve tables show that the prob- 
ability of obtaining a deviation equal to or 
larger than 1.70 from the mean by pure 
chance is .o9. This does not quite disprove 
the null hypothesis that the scoring key was 
obtained in the course of random sampling, 
as a probability P of .os or less is required. 
This P of .o5, or the five per cent point, is 
the fiducial or confidence limit, although the 
stricter requirement of a one per cent point is 
often demanded. If in the same s50-item true- 
false test the scoring key has 11 true items, 
proceeding as before 


1.7 
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25 — 11 
3-54 

P == .00006, the null hypothesis is readily dis- 

proved, and the scoring key is not such a 

combination as may be obtained in a random 

manner. 

In a 100-item 5-choice test, or in a 20-item 
5-response matching test where repetitions are 
allowed, that is, where in any item each one 
of the digits may appear any number of times 
between zero and five, the probability of each 


= 4.0 


digit is -, or .20, and the expected frequency 


5 

is this number multiplied by the number of 
responses, or 100 X .20 == 20. If the observed 
frequencies are as follows, 


Observed 
Frequency 


the chi-square test may be applied in the 
usual manner by substituting in the formula 


ca aan (2) 


where f, is the expected frequency of each 
digit and f, the observed frequency. Here 
x? = 4.70. Applying the chi-square test with 
four degrees of freedom because there are five 
classes and only one constraint, the sum of 
the frequencies which must equal 100, Fisher’s 
tables give a P of slightly over .30, which 
means that the null hypothesis is not dis- 
proved, and that the observed combination 
has been drawn at random from the total 
number of possible combinations. 


3. DirFERENCES BETWEEN CONSECUTIVE 
Dicits 


Matrix M shows the n* permutations of n 
digits in pairs, allowing repetitions. It may be 
observed that all the pairs in which the dif- 
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ference d equals zero are in the principal 
diagonal, the one running from the upper 
left-hand corner to the lower Tight- hand cor- 
ner, all those with d= 1 are in the two adja- 
cent diagonals, and so on, and that N,, the 
number of differences of size d is equal to 
2 (n—d), for o<d<n, and to nm for d = o, 
In the set of 5 digits 32749, d = 1 for the 
first two consecutive digits 32, d — 5 for the 
next pair 27, d == 3 for the next pair 74, and 
d == 5 for the last pair. There are (k — 1) 
such pairs or sequences in a set of & digits 
selected from m (k < nm). Whenever two con- 
secutive places are filled by any pair, (& — 2) 
places are left to be filled by the remaining 
(n—z2) digits, which can be done in 
(n — 2)* — »* different ways, and therefore, 
the number of consecutive digits with a dif- 
ference equal to d in sets of & digits selected 
from m digits is 2(n—d)(k—1)(n—2)“*-*, 
if repetitions are not allowed. Since the total 
number of different permutations is nm’, the 
expected number in one permutation, or that 
is, the average number of differences of size 
d per permutation, M,, repetitions not being 
allowed, is, 
M,= 2 (n—d) (k—1) (n—2)*-» 


2 (net) (3) 


for k < m and o<d<n. In the case of 
5-choice matching or rearrangement items, 
n= 5, and & = 5. Substituting in (3), the 
expected number of differences of 1, 2, 3, and 
4 in one item are, 


nh) 





a 


t++4 
aa 
ooo 
xXxxXXX 
ee be 
ee ee ee ee 
WW MW 
ol pave 


oe 


If there are 15 such items, each one of the 
expected values must be multiplied by 15. 


Doing so, the expected number of differences 
* The notati means factorial @ to x factors, i.., 
9m =9 XB ¥ 7 


quencie 
tremes, 
precedi 
duces t 
the nu 


Forn 
tests w 
hand s 
choices 
sponses 
numbe! 
exampl 
hand ¢ 
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expecte 
item, J 


ferenc 
that t 
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of sizes 1, 2, 3, and 4 in fifteen 5-choice 
matching or rearrangement items are respec- 
tively 24, 18, 12, and 6. The observed fre- 
quencies are obtained by tabulating them 
under the headings d,, d,, d,, and d, from 
the scoring key of the test, which is the pat- 
tern whose order is being tested for random- 
ness. Then chi-square may be applied, as in 
section 2, with three degrees of freedom. It 
‘must be remembered that if there are fre- 
quencies less than five at any one of the ex- 
tremes, these must be combined with those 
preceding them, and that this procedure re- 
duces the number of classes, and consequently 
the number of degrees of freedom. 


Formula (3) may also be used in matching 
tests where the number of choices in the left- 
hand side is greater than the number of 
choices in the right-hand side, where the re- 
sponses are recorded. In this case the former 
number equals m, and the latter equals &. For 
example, if there are ten choices in the left- 
hand column and six in the right-hand or 
response column, # == 10 and & = 6, and the 
expected number of differences of size one per 
item, M,, , is, from (3), 

2X9X5 
M,= exe 1.0 


But if the contrary is true, that is, if the 
number in the response column is the smallest 
of the two, then repetition of digits becomes 
necessary and formula (3) may not be used. 
If repetitions are allowed, the number of dif- 
ferences does not change except for the fact 
that there may be differences of zero when 
there are two consecutive digits of the same 
kind. The number of zero differences was 
seen to be m in Matrix M. The number of 
places to be filled is still (A — 2), but they 
may now be filled by all the » available digits 
instead of only by the remaining (n — 2), 
and this can be done in n * — * different ways, 
since repetitions are allowed, so that the num- 
ber of consecutive digits with difference equal 
to d in sets of & digits is 2(m—d)(k—1) 
n*-? and as the total number of different 
permutations is m*, the average number of 
differences of size d per permutation, M,,, if 
repetitions are allowed is 


My, =2 (n—d) (k—1) nk —? — nk — 
2(n—d) (k—1) (4) 
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for integral values of & not zero and for 
o<d<n. For do, 


_m(k—1) k—r1 
M,,= n? ass n 





(4a) 


These zero differences are not the same as 
the runs described in the writer’s other article 
in this same number (2). There the arrange- 
ment 22222 contains only one run of length 
five of digit 2, while here it is made up of 
four sequences of zero difference. 

Formulas (4) and (4a) may be applied in 
the same manner as (3) to matching items 
when the number of choices in the response 
column is smaller than that in the other col- 
umn. Besides, their use may be extended to 
multiple choice items. In a 50-item 5-choice 
test, m == 5 and & — 50, that is, the scoring 
key is a permutation or set of 50 digits 
selected from 5 digits, of course, allowing 
repetitions. Substituting in (4) and (4a), 

Ma,,= 49+5= 9.80 
Ma, =2 X 4X 49 + 25 = 15.68 
Ma,,=2 X 8 X 49 + 25 = 11.76 

t= 2X2X49+25= 7.84 
Mei =2X1K 49+ 25= 3.92 


49.00 


Observe that the total is always (A — 1), or 
the number of 2-digit sequences, here 49. The 
chi-square test may be applied after the ob- 
served frequencies are tabulated, but here the 
last frequency is to be combined with the 
next to the last one, and the number of de- 
grees of freedom is three. 


4. Sums or Consecutive Dicits 


By this method, in the arrangement 34215 
there are four sums: 3 + 4—= 7,4 + 2= 6, 
2+ 1=3,and1+ 5 — 6. If the diagonals 
in Matrix M are now drawn in the opposite 
direction, so that the longest diagonal runs 
from the upper right-hand corner to the lower 
left-hand corner, it will be seen that all the 
pairs in any diagonal have equal sums. The 
number of pairs in each diagonal diminishes 
in proportion’ to their distance from the 
middle one, but the sums of the pairs of digits 
diminishes from a maximum of 2m at the 
lower right-hand corner to a sum of two at 
the upper left-hand corner (the first row and 
the first column of the matrix are eliminated 
in order to avoid the digit zero). By induc- 
tion, the frequency of any pair with sum s is 
[s—1z—2a], where a—s—n— if 
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s>(n+ 1), and a= o if s<(m-+ 1). As in 
the preceding section, each pair, or sequence 
of two consecutive digits may occupy (& — 1) 
positions in a set of & numbers, and at all 
times there are (& — 2) vacant places to be 
filled by the (n — 2) remaining digits, when 
sets of & digits are selected from m digits. 
These places may be filled in (n — 2)‘*- » 
different ways, so that the number of consecu- 
tive digits with sum s is (s—1— 2a) 
(k—1)(n—2)*-” and the expected 
number or average per set or permutation 


M,, is 
M,—S=*—*O— ign) (5) 


where (a = s — 1 —n) ifs >(m + 1), and 


a=—oifs<(n-+ 1). The expected value for 

each item of a 5-choice matching or rearrange- 

ment test is found by substituting » — 5, and 

k = 5 in formula (5). Doing so, 
Mey=.20 M.;= 8 

May = .40 M.c= 1.06 M.;3>= 40 

M~. = 80 M.3= .20 


Each of these is to be multiplied by the num- 
ber of items in the test in order to get the 
expected value for the whole test or subtest. 
If the number of items is small it may be 
necessary to group these nine classes into 
four or five. 

If repetitions are allowed, as before, the 
(k — 2) places may now be filled by all of 
the n digits in n*— * different ways, and the 
frequency is now (s — 1 — 2a)(k — 1)n* —* 
from which M,. , the average sum of size s 
per permutation is thus obtained: 

M,.= (s —1— 2a)(k — 1) nk— ? — k= 
(s—1—2a)(k—1) (k>o) (6) 


where a = (s — n — 1) if s>(n + 1) and 
@=oifs<(n+ 1). This formula may be 
applied to 40 items of a 5-choice test. Substi- 
tuting in it the values m — 5, and k = 40, 
Mey, =156 Ms,,=624 Ms, = 4.68 


My =312 Mn =780 M..=3.12 
My, =4.68 8M... = 6.24 





M.,= .60 





4 
*rg 


M.,, = 1.66 


Here it will be necessary to group the first 
three frequencies into one class, and proceed 
likewise with the last three classes, thus re- 
ducing the nine classes to five, and the num- 
ber of degrees of freedom to four. It is natu- 
rally assumed that the size of the observed 
frequencies will be similar to these. It is 
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worth noting that formulas (4) and (6) do 
not require that the expected values be multi- 
plied by the number of items, because it js 
the entire scoring key as a permutation that 
is being tested, while the expected values 
given by (3) and (5) are for each item ofa 
matching or rearrangement test. 


_ §. NUMBER OF CONSECUTIVE Dicits 


The length of a sequence of consecutive 
digits is the number of digits in it, ie., 23, 65, 
and 87 are of length two, while 2345 and 9876 
are of length four. In the arrangement 
13297654, there are eight sequences of length 
one; four sequences of length two, 32, 76, 6s, 
and 54; two sequences of length three, 76s, 
and 654; and one sequence of length four, 
7654. The reader may have noticed that here 
as well as in sections 3 and 4, the sequences 
are dependent on each other, and for certain 
purposes, such as the estimation of the prob- 
ability of occurrence of two or more of them, 
the intercorrelations are necessary. 


By inspection of the series 12345 .. . ., it 
is observed that the number of sequences of 
consecutive digits of length two that can be 
formed from # digits is 2(m — 1), in which 
the direct, and also the reverse sequences 21, 
32, 43, and 45 are included. It is equally 
apparent that the number of sequences of 
length & that can be formed from n digits is 
2(n — kh + 1). In sets of & digits selected 
from » digits, where k<m, any sequence of 
length two may occupy (k — 1) different 
positions, and in general, any sequence of 
length & may occupy (k — A + 1) different 
positions, and there will always be (k — h) 
places to be filled by the remaining (n — h) 
digits, which can be done in (mn — h)*-" 
different ways, no repetitions being allowed. 
So that frequency is 
2(n —h+1) (R—hk +1) (n—h)*-” 
and the average number of sequences of con- 
secutive digits of length 4, per permutation, 
in sets of & digits selected from m digits, no 
repetitions being allowed, is 

My, = 2(n—h+1) (R—h +1) 

(n —h)&«—-» — n°) woe 


a(e A Oar (7) 





where o<hk<k<n, n® —n°—1, and 
0°) == zs when A = 1, the number in the 
numerator is omitted. If nm — k, (7) becomes 





March, 1944] 
2(n—h-+ 1)? 


n™ 





= (7a) 
This test gives small frequencies for values of 
h greater than two. For n= 5, k= 5, 
My, = 5, Mn, = 1.6, Mr, = 3, M,, = -07, 
and M,, = .o2. After multiplying these by 
20, for a 20-item test, only the first three give 
values greater than five, and the first one, 
M,,, may not be used because it is constant 
and equal to ». Two classes remain which can 
be tested by the chi square method with one 
degree of freedom, or by the use of formula 
1). 
) by the same reasoning used in the deriva- 
tion of formulas (4) and (6), the expected 
number of sequences of length A in one 
arrangement or permutation, when repetitions 
are allowed, is 

M, a(n—h+1)(k—h+1) 


ir n> 





(8) 


where o<A<k<, — or >n. This formula 
also produces small frequencies. For example, 
form = 5 and k — 20, M,,, = 6.08, and 
M,,, = .86, the others being still smaller. 
The first one may be tested through formula 
(1). 

6. Opp-Even SEQUENCES 


Sets of & digits may be formed from n 
digits, of which m, are odd and m, are even, 
no repetitions being allowed, so that k<mn. 
Let the four possible arrangements be desig- 
nated as follows: 


00 == odd after odd 

oe == even after odd 
eo = odd after even 
ee = even after even 


In the permutation 31245, the sequence 31 is 
00, 12 is oe, 24 is ee, and 45 is eo. As in the 
previous sections, there are (& — 1) positions 
for each sequence, (& — 2) vacant places to 
be filled by the remaining (m — 2) digits, and 
this can be done in (mn — 2)*-—* different 
ways. The number of 00 sequences or pairs is 
n., that of ee is n.‘, and that of both oe 
and co equals m, X< n,.. The total number of 
kpermutations from n digits is “*?, so that 
the expected values per permutation are, 

M,, = (k — 1) n, +n™ 

M,. = (k— 1) n,n, —n™ 

M., = (k—1) n,n. =n (9) 

M.. = (k — 1) n,°? —n™ 
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Here if » is odd, n, — mn, + 1, otherwise, 
n, = m,. By substituting in (9), values ex- 
pected in one permutation for » = & are given 
below. 


Sequence 


These are good values for applying the chi 
square test with three degrees of freedom. 
They must be multiplied by the number of 
items. Again, proceeding as in the previous 
sections the expected values, when repetitions 
are allowed, are as follows: 


Moo, == (k — 1) n,.? —n?* 
M,., == (k— 1) non, n° 
Mo, = (k— 1) non, n°" k<,—=, OF > 
MM. == (kR— 1) n.? — 1" 
(10) 


This formula may be applied to matching 
tests with items having a number of choices 
in the response column greater than that in 
the other column, as repetitions then become 
necessary. Let & be the number of choices in 
the response column, and m the number of 
choices in the other column. For n = 8, 
No = 4, NM, = 4, and k = 4, each one of the 
four expected values equals .75. If there are 
eight or more items, after multiplying by the 
number of items the frequencies will be large 
enough for the application of the chi square 
test. As these sequences are correlated with 
each other, the probability of obtaining two 
or more of them in the same permutation may 
not be had without taking into account the 
intercorrelations. 


7. Ups anp Downs 


This test, as well as the three following 
ones, is offered by E. L. Dodd (6). When a 
series of digits is split into pairs, each pair is 
classified as either up or down according to 
whether the second digit is larger or smaller 
than the first. When they are equal, the 
“monotone increasing” sequences 55, 66, 77, 
88, and 99 are included in s#p while the “mon- 
otone decreasing” sequences 00, II, 22, 33, 
and 44 are included in down. For example, 
59 and 66 are up, but 72 and 33 are down. 

Let a series of eight digits be considered as 
four pairs. Thus, the series 38458617 is made 
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up of the four pairs 38, 45, 86, and 17. This 
arrangement is designated as up wp down up. 
Allowing repetitions, 16 such arrangements 
are possible. Matrix M shows that the num- 
ber of up and down sequences are exactly 
equal. Therefore, the probability of drawing 
either an up or a down at random is .50, and 
the probability for each one of the 16 arrange- 
ments made up of four pairs is .5* or .0625. 
This is a small probability for the usual type 
of test, and unless there are about 100 sets of 


eight digits each, the frequencies are too small 


for the application of the chi square test. 
Such eight-digit sets may be found in match- 
ing tests, but there are always less than 100 
items. Perhaps the test may be used in deéal- 
ing with longer series of psychological data, 
which can be divided into eight-digit sets. 


8. MAXIMA AND MINIMA 


This is another of Dodd’s (6) methods. It 
is an extension of the preceding one, splitting 
the series into sets of three instead of into 
pairs. Let the sequence abc be called “mono- 
tone increasing” if a<b<c, with a<c; or if 
a= 6 =—c>s5. Let it be called “monotone 
decreasing” if a> b>c, with a>c; or if a =— 
b= c<~5. If abc is neither monotone increas- 
ing nor monotone decreasing, then 0d is either 
a maximum or a minimum, with a<b>c for 
a maximum and @>d<c for a minimum. 
Designate the classes as follows: 


Monotone increasing as up 
Monotone decreasing as down 
b a maximum as Max 

5 a minimum as Min 


The frequencies of these classes may be 
studied in Matrix M. All the pairs in the last 
row and in the last column contain a number 
nine. Since this is the largest digit, it is not 
possible to form a maximum by putting an- 
other different digit between the two that 
make up any of the pairs. Suppose that the 
last row and the last colmn of this m XX n 
matrix are eliminated so that an (n — 1) X 
(n — 1) matrix is left. It will be observed 
now that at least one different digit may be 
put between the two that make up any of the 
(n — 1)* pairs in the remaining matrix in 
order to make a maximum. Eliminate again 
the last row and the last column so that an 
(n —2) X (nm — 2) matrix remains. Now 
at least two digits may be added to the re- 
maining (m — 2)? pairs in order to make max- 
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ima. It should be obvious by this time tha 
the number of maxima that may be formed 
from n digits is (n—1)* + (m—2)*4 
(m — 3)? + -.---- 1. For the ten digits from 
zero to g this is o + 1 + 4 ---- + 81 = 285. 
It may be seen that the number of minima js 
exactly the same. This may be checked 
beginning with the first row and the first col. 
umn, eliminating them, and proceeding like. 
wise until the last. As the wps and downs are 
equally likely, the number of either is 500 — 
285 = 215, and since there are 1000 pairs, 
the probabilities are .285 and .215. 

This test may be applied to six-choice 
multiple choice, matching, or rearrangement 
tests by separating the scoring key into sets 
of three digits. The chi square test with three 
degrees of freedom may be applied after mul- 
tiplying these probabilities by the number of 
3-digit sets, and tabulating the four classes 
from the scoring key. 

If a series of six digits is considered as two 
sets of three, then 16 classes can be formed 
such as: 


034 «9756 | 752 354 | 662 
up Min down Max 


As these sets are independent of each other, 
the probability of each one of the 16 pairs 
is found by multiplying together the prob- 
abilities of both sets. For example, 


p of up Min. = .215 X .285 = .061 
p of down down = .285* = .081 
p of up down — .2157 —= .046 


788 | 
down up | 


As these probabilities are smaller, this test 
is better adapted to long series of digits 
which can be divided into pairs of sets of 3 
digits. 

9g. RANGE TEST 


The preblem is to find out the number of 
sets with & digits and range R that can be 
formed from n digits, repetitions being 
allowed, the only requirement being that 
o<R<n. The range is, as usual, the differ- 
ence between the largest and the smallest 
number in the set. The problem is easily 
solved by determining first the number of sets 
with a range equal to (m — 1), which, in turn, 
can be had without difficulty by first finding 
out the number of sets that have a range 
smaller than (n — 1). Let then k — 4,2 = 
7 (the 7 digits from o to 6), and since the 
number of sets with R — (n — 1) is required 
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first, let R == 6. The number of sets with 
R<6 is calculated in Table I. 


TABLE 1 


NumBer OF Sets or Four Dicits SELECTED 
From SEVEN, WITH A RANGE 
Less THAN Six 
= 625 
= 1000 


= 3800 
40 


2 
1967 

The total number of sets of four digits that 
can be made from the seven numbers, repeti- 
tions allowed, is 7* == 2401. Of these, those 
that contain neither the digit o nor the digit 
6, plus those that contain one of these digits 
but not the other, have a range smaller than 
six. In Table I the o stands for the digit o, 
and the dash stands for the places to be filled 
by any of the other five digits 1, 2, 3, 4 and 
5. The first line shows the different ways in 
which four places may be filled by five digits, 
or 5*. The second line of the table contains 
one o and three places to be filled by the five 
digits, which can be done in 5° ways. The o 
can occupy four different positions. As the 
same number of permutations is obtained 
when the digit 6 is used instead of the digit 
o, the total number of permutations of this 
type that have a range of less than six is 
5° X 4 X 2. The third line contains two o’s, 
and two places to be filled by five digits which 
4X3 
as 
represents the number of different ways in 
which four places can be filled by two things 
that are alike. The numbers in the remain- 
ing lines are obtained in a similar manner. 
Except for the factor 2 which does not occur 
in the first line, the numbers in the table are 
those in the formal expansion of (5 + 1)* 
and therefore, the sum of the terms is 
2 X 6* — 5*. As this is the number of sets 
with a range of less than six, the number with 
a range equal to six, Nr, is found by sub- 
tracting it from the total number of sets, and 


Nr, = 7*—2 XxX 6* + 5* 
and in general, for oc R<n, 
Nay = (R + 1)*—2Re+ (R—1)k (11) 


can be done in 5* ways. The fraction 
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This formula was given by Dodd (6), but 
the derivation was not offered. Substituting 
in it for R — 6 and k — 4, Nz, = 434. It is 
well to note that the number of sets with a 
range of six that can be obtained from the 
digits from o to 6, is the same as the number 
that can be obtained from the digits from 2 
to 8, from 3 to 9, or from any other two num- 
bers whose difference is six. This makes it 
possible to calculate the number of sets with 
a range of six for any digit greater than 6. 
Thus if c be any digit so that c>n, cp, = 
Nz, (c — nm + 1), and since R =n —1, 
Cr, = Nu (¢ “ines R) 


From this, the number of four-digit sets that 
can be obtained from the ten digits from o 
to 9 is 434 (10 — 6) = 1736. As the total 
number of sets is 10* or 10,000, the prob- 
ability of getting a four-digit set with a range 
of six is .1736. 

This test may be applied to m-choice 
matthing tests when repetitions are allowed. 
It may also be used for the scoring key of a 
multiple choice test after it has been divided 
into small sets. If nm — 6,k — 6,R=—5, 

Na, = O& —2 X 5° + 4° = 19502 
As the total number of sets is 6° — 46,656, 
the probability of obtaining a set with a 
range of five in sets of six digits selected 
from 6 digits is 19502 — 46,656 — .418; and 
the expected number in 20 such sets or items 
is .418 XK 20 = 8.36. If n — 4, k = 6, 
R= 3; 


Na, = 4° —2 X 3° + 2° = 2702 
and by (11a) 
Cr, = 2702 (6 — 3) = 8106 


(11a) 


This is the number of sets with a range of 
three in sets of six digits obtained from six 
digits. As the total number of sets is still 
46,656, the probability of obtaining a set 
with a range of three is .174. This requires 
30 items to yield frequencies that are large 
enough to be tested with chi square. 


10. REPLICATION OR POKER TEST 


This test is given by Dodd (6) as a modi- 
fication of the poker test made by M. G. 
Kendall and B. B. Smith (8) in England, and 
it consists in determining the probability of 
obtaining in small sets of numbers, just one 
pair or double, or just one triple, one quad- 
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ruple, one quintuple, one sextuple, one sep- 
tuple, one octuple, etc., or any combination 
thereof. As applied to small sets of digits, 
what is wanted is the probability of getting 
just two of a kind, for example, no matter 
what the position occupied by these two in 
the set. This method as well as that described 
in section 3 must be distinguished from the 
method of runs described in (2). Here the 
set 534124 contains one pair of digits 4, but 
it contains no runs because these digits do 
not occupy consecutive places. 

Take the case of sets of 5 digits obtained 
from 5 digits, that is, k = 5, m — 5. The 
frequency of sets containing just one pair is 
desired. Naturally, no repetitions can occur 
in the other three digits because if they did 
occur, there would be more than one pair, or 
a triple, or one pair and a triple, or a quad- 
ruple, or a quintuple. Then when there are 
three single digits and one pair, there are 
altogether four different kinds of digits in the 
set. The number of different ways of com- 
bining four different digits selected from five 





. An example of such 


a combination is 42312. Since in each com- 
bination each one of the four digits can be 
the double and the three remaining the single 
digits, there is a fourfold increase in the num- 







Combination 








TABLE 2 
PROBABILITY OF THE SEVERAL REPLICATIONS IN Five-Dicit Sets 
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ber of combinations. Now, the number of 
different ways in which five things, of which 
two are alike, can be arranged iss. There- 
fore, the frequency of sets with just one pair is 
(4) 
ri 71 * 4 xi- 1200, and since there are 
3: = 3125 possible sets, allowing repetitions, 
the probability of obtaining just one pair in 
one set is 1200 —- 3125 = .384. In the case 
of one pair and one triple, there are only two 
different kinds of digits in each set, and the 


(2) 
number of combinations is 5. This num- 


ber is to be multiplied by two because for 
any one combination such as 24224, in which 
there is a double of fours and a triple of twos, 
there is the other combination* 42442. The 
number of different arrangements for five 
things of which aged are alike and three 


, and therefore, the 
frequency of 7 with one .e pair and the triple is 


(2) 
3x 2 x 


others are alike iss 


== 200. 


The others are obtained in like manner, as 
shown in Table 2. 


*This is not another permutation because it is not a 
matter of change of position only. 


Probability 
. 0384 


Frequency 
120 


1200 . 3840 


900 
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These probabilities are large enough to apply 
the chi square test in the case of 20 items, 
repetitions being allowed. 

Dodd gives general formulas for all cases, 
but the procedure outlined here is so much 
simpler that the formulas are omitted. 


11. INVERSIONS 


An inversion occurs in a series of digits 
whenever one digit is preceded by a larger 
digit. Thus, in the series 45132 there are 
seven inversions: 3 before 2, 5 before 2, and 
4 before 2; 5 before 3, and 4 before 3; 5 be- 
fore 1, and 4 before 1. Table 3 illustrates the 
several positions that the inversion 1,0 may 

TABLE 3 


NUMBER OF POSITIONS THAT ONE INVERSION 
May Occupy IN A SERIES 


—_ 1 


occupy i a set of four digits, or 3 + 2+ 1 
= 6. For any value of m, the number is equal 


to—n (n — 1). It is obvious again that the 


number of places left vacant is (mn — 2) and 
that, not allowing repetitions, they can be 
filled by the (m — 2) remaining digits in 
(n—— 2) ! different manners, so that the total 


number of inversions 1,0 is <n(n—1) 
(n—z2)! In Matrix M it is seen that 
<n (n—1) different inversions can be made 
from » digits. Hence, the total number of 
inversions in the distribution is n? (n—1)? 
(n—2) !,or—n(n—1)n! Since the num- 


ber of permutations in the distribution is 

n 1, M,, the average number of inversions per 

permutation is given by 
ee n(n—1) 

i eee (12) 


The variance of the number of permutations 
has been given by Dantzig (5) as 
2. (n— 1) (2n— 5) 

72 


| 





(13) 
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He also found that the distribution of inver- 
sions becomes practically normal for m > 6. 
This permits the application of this test of 
randomness to small sets of digits individu- 
ally. For instance, if nm — 8, M, — 14, and 
o, = 1.03. The sets 87654321, and 12345678, 
with 28 and zero inversions respectively are 
not random arrangements because they devi- 
ate by some 14 o from the mean; but the set 
63548127 with 15 inversions, is a random 
arrangement as it only diverges from the 
mean by one co. 


The formula for the number of inversions 
for the case when repetitions are allowed has 
not been given. As in the preceding sections, 
and since the number of vacant places may 
now be filled in 2" ~* ways, the total number 


: : ha wi ie 
of inversions in the distribution is 3 n? 


(n—1)* n°? orn "(n —1)* and as the 


number of permutations in the distribution is 
now #", the average number of inversions per 
permutation, M,_, if repetitions are allowed, 
is 
2 
Mu. = (n—1¥ (14) 
os 

Another problem that one may need to 
solve is that of finding the expected number 
of inversions when sets of & digits are selected 
from n digits. When & = n, the above for- 
mulas naturally apply. When &<n, the num- 
ber of positions for each inversion is 


<k (k—1). The number of different kinds 


of inversions is still — n(n—1), and the 
(&—2) vacant places may be filled by the 
(m—2z2) remaining digits in (mn —2)‘*-” 
different ways, so that the total number 
of inversions isk (k—1)n(n—1) 
(nm—2)‘* — », Here the total number of per- 
mutations is m‘*’ and the average inversion 
per permutation, M,, , no repetitions allowed 
is ‘ 

My = k(k—r1) 

4 

where k<n. When &>n, repetitions are 
necessary, and now the (& — 2) vacant places 
may be filled by all the m digits in »*-* 
ways, the total number of inversions is 


(15) 
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7 (k—1) n(n — 1) n*— *, the total num- 
ber of permutations is m*, and the average 
number of inversions per permutation is 


M,. oitiha tie ie= 2 (16) 
7. qn 


Another method proposed here is that of 
classifying the inversions by their order. If 
the difference between the two numbers mak- 
ing an inversion is #, call this an inversion of 
order ¢. That is, inversions 21, 51, and 42 
are respectively of the first, fourth, and sec- 
ond orders. As repetitions are not allowed 
there will be no inversions of zero order. 
These would be runs of length two when in 
adjacent positions, unless preceded or fol- 
lowed by more of the same kind of digits, in 
which case their length would increase cor- 
respondingly. 

It was pointed out above that Matrix M 
shows the number of kinds of inversions to be 


equal to—n (n—1). It is equally obvious 


there that the number of inversions of order 
t is (n —t). Then from these and from for- 
mula (12), the average number of inversions 
of order ¢ per permutation is (nm —t) + 
«n(n —1) x n(n—1), or 


s 4 
M, — “> (17) 





This value is to be multiplied by the number 
of permutations. For example, if there are 
twenty 5-choice matching items, m — 5, and 


M,,=2—. Multiplying this by 20, the 


expected number of inversions of order ¢ in 
20 items is 50 — 10 ¢, and the expected num- 
ber of inversions of order one, two, three, and 
four are respectively 40, 30, 20, and 10. The 
chi square test with three degrees of freedom 
may now be applied. 

M. G. Kendall (8), and A. C. Rosander 
(14) have suggested the use of the method of 
inversions for estimating the rank correlation 
coefficient. Since the maximum number of in- 


versions in a series of n digits isn (n—1), 


the minimum number is zero, and the mean 
is just midway between these two extremes, 


or atin (n— 1), as given by formula (12), 
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it may be said that when two series have a 
perfect correlation of 1.00, the minimum num. 
ber of inversions is obtained, when the rank 
correlation is zero, the mean or expected 
value is obtained, and when the correlation 
is —1.00, the maximum number of inversions 
is obtained. If the number of observed inver. 
sions is x, and the maximum number x,,, 
2x 4x 
i= I— ay (18) 


This formula yields the values stated above. 
The procedure is similar to the rank method, 
One of the two traits is ranked following the 
order of the first m natural numbers, and the 
second trait follows in a consequential man- 
ner. As a simple hypothetical illustration, 
take a case where m = 5. 


The inversions are now counted from left 
to right, by counting the smaller number 
following, instead of the larger number pre- 
ceding, as the result is the same. Here x = 5, 
and substituting in (18), r; — 0. The vari- 
ance is given by (13), and for m>6, the 
usual test of significance may be applied 
through the normal probability tables. 


12. PHASES 


This test has been used in time series in 
economics by W. A. Wallis and G. H. Moore 
(17)(18). In a series of numbers, all dif- 
ferent, the point at which the series either 
ceases to rise and begins to decline, or ceases 
to decline and begins to rise, is called a turn- 
ing point. The former is a “ ” or max- 
imum, and the latter a “trough” or minimum. 
The interval between these two is a phase, 
which is called an “expansion” if the peak 
follows the trough, and a “contraction” if the 
contrary is true. In this test the incomplete 
phase before the first turning point and the 
one following the last turning point are 
ignored. In a series of m numbers or ranks 
the number of turning points ranges from 
zero to (m — 2) and the maximum duration 
of a phase is (n — 3). The “duration” of a 
phase is the number of intervals in it. For 
example, in the series 1274890563 there are 
the five turning points at 7, 4, 9, 0, and 6, of 
which 7, 9, and 6 are peaks, while 4 and o 
are troughs. The incomplete phase at the be- 
ginning, 127, and that at the end, 63, are not 
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counted. The four phases that are counted 
are: 74 of a duration of one, which is a con- 
traction; 489 of a duration of two, which is 
an expansion; 90 of a duration of one, which 
is a contraction; and 056 of a duration of 
two, which is an expansion. Since the differ- 
ence between the consecutive digits is not 
considered, it is convenient to write the signs 
of the differences between the successive 
digits instead of the series itself. Thus, the 
above series may be represented by 
++—++—+-+—. The number of signs 
is (n— 1). 

Wallis and Moore (17) have derived the 
following formulas, for the expected number 
of complete phases of a duration of d, My, 
and for the mean duration of a phase My: 


2(@ + 34+ 1)(N—d—2) 
(d+ 3)! 


jm — 11.6194 
2n—7 


My. = 





(19) 


Ma= (20) 


Three classes may be obtained for n>6. 
Those of a duration of one, those of a dura- 
tion of two, and those of a duration of three 
or longer. As the distribution of this statistic 
diverges a little from Pearson’s chi square, 
the authors have made a probability table of 
it for values m from six to twelve. The test 
may be used for small sets of digits of size 
six or larger, such as are found in matching 
tests, no repetitions being allowed. 


Another method given by Wallis and Moore 
(17) is through the use of the normal prob- 
ability tables, as the total number of com- 
pleted phases is normally distributed, with a 


mean equal to—(2n— 7) and a variance of 


5p (168 — 29). But here, as the series is not 


actually continuous, .5 must be subtracted 
from the difference between the observed and 
the expected value to correct for discontin- 
uity. 

This is another method that may also be 
used as a substitute for rank correlation, by 
first arranging one variate in ascending order, 
and then tabulating the phase duration of the 
resulting arrangement of the correlated vari- 
ate. Here, a non-significant chi square will 
mean that the variates are independent. 
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13. EXPANSIONS AND CONTRACTIONS 


This is another test given by Wallis and 
Moore (18), and it is an over-all test of the 
duration of the expansions and contractions 
in a series. In the example given in the pre- 
ceding section, the series 1274890563 is rep- 
resented by +-+—+-+—+-+—. The + 
signs represent expansions, and their number 
is the duration of the expansion. Likewise, 
the — signs are contractions, their duration 
being measured by the number of such signs. 
As there are (m — 1) signs, the expected 


number of signs of either kind is —( n—1) 


and their variance is — (n + 1), the num- 


bers being all different. For large values of 
n probabilities may be had from the normal 
curve tables. For values of m from two to 
twelve the distribution of the discrepancies 
between observed and expected values are 
tabulated by the authors. It is evident that 
this test may be applied to the scoring key 
of a true-false test. It may also be used as a 
rank correlation method, arranging first one 
variable in the usual manner, and tabulating 
the signs of the differences from the second 
variable so as to form a series of signs as the 
one shown in the above illustration. 


14. Runs or Two KInps 


This method described by F. S. Swed, and 
C. Eisenhart (15) is used for testing whether 
two samples belong to the same population. 
When m objects of one kind and n objects of 
another kind are arranged along a line, there 
are runs of either kind in the arrangement. 
The total number of runs of both kinds and 
of all lengths is usually designated as u. Thus, 
in the arrangement aabbbab, m = 3,n = 4, 
and “ = 4. Let n>m. The number of dif- 
ferent arrangements possible is (mnt a 
Then, P[#<’], or the probability of obtain- 
ing a value of u’ equal or greater than u, is 


ul 
mint S fu 


(m+n)! vi, 

Where u<u , and where 

jase 2(m—1)! 
(k—1)1(m—k)! 


(n—r1)! 
eee) fa—bte™ u <= 2k, and 


P[u<wu’] = 


(21) 





x 
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(m—1)! 
~ (k—1) 1 (m—k) 1 
(n—1)! 
(k—2)!(n—k+1)! 
(m—r1)! 
(k—2)!1 (m—k+1)! x 
(n—r)! 
(k—1)l(n—k) 1 


when “== 2k — 1, for k — 1, 2, 3, 
(m + 1). 

The authors give tables for the values of 
P[w<x’] for M<< 20, for m from two to 
20. This is the fourth one of the methods 
described that may be used as a substitute of 
the rank correlation coefficient. It may be 
tried for testing the reliability of measure- 
ments, that is, for testing the null hypothesis 
that two sets of measurements “belong to the 
same population”, or in other words, that it 
is the same trait that is being measured in 
both instances. Suppose that two measure- 
ments are taken on each of five things A, B, 
C, D, and E, with the results shown below. 


A 4.0 D FE 
25 28 23 35 29 
22 30 27 





x 





on 








If now the ten measurements are arranged 
in order of magnitude, and the second meas- 
urements are distinguished by being included 
in parenthesis, the following series is obtained: 
(16) (22) 23 (24) 25 (27) 28 29 (30) 35 
Here m = 5, m = 5, and u, the total number 
of runs, or the total number of distinct groups 
is equal to eight. Swed and Eisenhart’s tables 
(15) give a value of .9603175 for P[u<w'}, 
and therefore the null hypothesis is not re- 
jected. For perfect reliability, the maximum 
value of P or 1.00 is required. This method 
may cause some trouble when two or more 
of the measurements are equal. 


15. Runs ABOVE AND BELOW THE MEDIAN 


This method is given by F. Mosteller (13) 
as a modification of the method of runs. Take 
a series of observed magnitudes in terms of 
numbers or digits of size 2n. If the 2 m cases 
are arranged in ascending order; the m‘* and 
the (mn + 1) terms are the middle terms 
(If there are (2m +- 1) cases the middle term 
is ignored by this method). Any case x, is 
called a if x;<m, and b if x, >(n + 1). Now, 
in the original or observed arrangement, a 
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run of a’s is called a “run below the median”, 
and a run of 0’s is called a “run above the 
median”’. 


The article (13) gives a formula for esti. 
mating the probability of obtaining at least 
one run of either a’s or b’s alone of any given 
length. The formulas are long and include 
several summations and several binomial ¢o- 
efficients. Two tables are given, the first one 
including minimum length of runs on either 
side of the median, and on only one side of 
the median for values of 2m from 10 to 50 for 
the significance levels .o5 and .o1; and the 
other giving the probability of getting at least 
one run of size one to 13 in samples where 
2n == 10, 2m = 20, 2m = 40, On one side, on 
either side, and on both sides of the median. 
This method has been found to be efficient in 
quality control of mass production for the 
purpose of indicating the existence of assign- 
able causes. 


16. OrHER METHODS 


Kendall and Smith (8) have made some 
other tests besides the Poker Test described 
in Section 10. A frequency test is proposed 
to ascertain whether the ten digits appear 
with about equal frequency, a serial test 
which consists in correlating pairs of consecu- 
tive digits, and a gap test which is essentially 
a run test for two classes, zero and not zero. 
Yule (20) has made a summation test de- 
signed to lead to an approximately normal 
distribution, which is then tested. W. O. Ker- 
mack and A. G. McKendrick (10) have pro- 
posed a periodicity test, and Wallis and 
Moore (19) a regularity of pattern test. 
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DISTINGUISHING METHOD DIFFERENCES BY USE 
OF DISCRIMINANT FUNCTIONS* 


WriLtiAmM DowELt BaTEN' and Hazer M. HATCHER? 
Michigan State College 
Agricultural Experiment Station 
East Lansing, Michigan 


As part of an experimental study in home 
economics (1), made to determine the rela- 
tive effectiveness at the secondary level of 
two methods of instruction, two groups of 
students were taught a unit in meal planning 
and preparation by the same instructor. One 
method, referred to as the control method, 
was wholly directed by the teacher, who also 
determined the objectives, planned the proce- 
dures to be followed, and evaluated the 
pupils’ achievement. In the experimental 
method, the teacher and pupils together de- 
termined the goals they wished to reach, de- 
cided how best to work toward these goals, 
and together checked accomplishment as the 
unit progressed. There was no statistically 
significant difference between the two groups 
on five variables, namely, IQ, pre-test score, 
socio-economic level, as determined by the 
father’s occupation, grade level, and age. 

The final evaluation for each student was 
in terms of the following: 


(1) A pencil-and-paper test which had a 
coefficient of reliability of -95 for the 
five hundred or more senior high 
school pupils in foods classes partici- 
pating in the experiment; 

Food score cards which had been 
shown to have coefficients of objectiv- 
ity of approximately .go in an earlier 
study; and 

Check lists for meal preparation and 
serving similar to devices which had 
been found -to have coefficients of 
objectivity of .go or higher, when used 
by trained raters in other investiga- 


tions at the University of Minnesota. 


R. A. Fisher (2) developed the discrim- 
inant function for comparing linear com- 
pounds made up of séveral variables. This 
function enables one to test the significance 
of a difference between the averages of two 
compounds each made up of two or more 

* Journal Article $655, new series, Mich. Agr. Expt. Sta. 


1 Research associate, Mich. Agr. Expt. Station. 
2 Associate Professor of Education. 


variables. Literature pertaining to discrim- 
inant functions can be found in references 
(3), (4), (5), (6). 

Let x, y and z represent the scores of the 
students taught by the experimental method 
on the final test, the products prepared (food 
score card), and certain observed abilities in 
meal preparation and serving (check list). 
Let x’, y’ and 2’ represent similar quantities 
pertaining to students taught by the control 
method. Let the linear compound of these 
scores pertaining to the former method be 


X =—=ax-+ by+ cz 


where the constants a, 6 and c are to be 
found; let the linear compound of these 
quantities pertaining to the latter method be 


X’ == ax’ + by’ + cz’ 


where the constants a, b and c are the same 
as in the compounds pertaining to the experi- 
mental method. Let the difference between 
the means of these compounds be 


(1) D=ad, + bd, + cd, 

where d, — x — x’, d, = y— y', d, = 
z— 2’ and x, y, 2, represent respectively the 
means of the scores of the students in the 
experimental group made on the final test, on 
products prepared, and on observed abilities 
in meal preparation and serving; and x’, y, 
z’, represent respectively the corresponding 
means pertaining to the control group. Let 


S = @°S,, + 5°Sy2 + 7S, + 2005,, + 
2acS,, + 2bcS,,, where 
Syy = Sa — (3x)*/m 4 Bx? — (3x’)*/m, 
S2g = 39° — (Zy)?/m + By’? —(Sy’)*/m, 
Sug = 28" — (32 )?/m + Xs”? — (32’)*/m, 
Sy, = xy — (3x)(By)/m + 3x’y’ 
(3x’) (2y’)/m, 
S,, == Sxz — (3x) (32) /n + Sx’2’ — 
(3x’ ) (S2’)/m, 
S25 = Sz — (Sy) (32) /n + Xy’2’ — 
(2y’) (22’)/m, 
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where n and m are the number of students in 
the respective groups. The quantities S,,, 
Sw, 53s are the within sums of squares and 
the quantities S,,, S,,, S,3, are the within 
sums of products of the scores. The quantity 
§ is the within sum of squares pertaining to 
the compounds X and X’. 

By maximizing the ratio D*/S the follow- 
ing equations arise: 


aS,, + 5S,, + cS,,=d,, 
(2) {2S + See + CS.3 = d,, 
aS,, + 5S,, + cS;, = d,, 


from which the constants a, 6 and c can be 
found. 

The compound D is the one linear com- 

d of these 3 sets of scores, among all 

possible linear compounds, which discrim- 
inates most between one method of teaching 
and the other, as far as these three sets of 
scores are concerned. 

From the data for the two groups meas- 
ured, equations (2) become: 


3,219.46@ + 291.486 + 609.18¢ 
= 2.39, 
(2) 291.48¢ + 1,794.615 + 
— —2.13, 
609.184 + 884.30) + 3,204.61¢ 
—= 23.47; 
from which a == —0.000449, b —'—0.005515, 
C= 0.008931. 
The compound X is 


X = —0.004490% —0.005515y + 0.0089312. 


The means pertaining to X and X’ are 
respectively 
Xx — —0.000449(59.56) —0.005515(68.00) 
_ + 0.008931(80.04) = +0.313, 
X’—= —0.000449(§7.17) —0.005515(70.13) 
+ 0.008931(56.57) —= +0.093. 
The difference between these means is X — 
X’ = 0.220. This difference can be found 
directly from D as follows: 
(3) D — —0.000449(2.39) —0.005515 
(—2.13) + 0.008931 (23.47) 
= —0.001073 + 0.011745 + 0.209611 
= 0.22, 


as before. 


884.30¢ 


The analysis of variance pertaining to the 
compounds X and X’ is given in Table 1. 
The asterisks in the last column indicate that 


DISTINGUISHING METHOD DIFFERENCES 
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the linear compound means made up of the 
three sets of scores made by the students are 


’ significantly different. This means that the 


linear compound, made up of scores of stu- 
dents in home economics pertaining to the 
experimental method is quite different from a 
similar linear compound pertaining to the 
control method of teaching. If the X and X’ 
compound — scores represent measures of 
ability or competence at the end of the course, 
then the amount gained from one method of 
teaching is greater than the amount gained 
from the other, (assuming the adequacy of 
the initial matching). 
TABLE 1 
ANALYSIS OF VARIANCE OF THE LINEAR 
COMPOUNDS PERTAINING TO SCORES 


FroM THE Two METHODS 
oF TEACHING 


The sizes of the terms in equation (3) 
show which of the scores (final test, products 
prepared, or observed abilities in meal prepa- 
ration) is most important, next most impor- 
tant, etc. for distinguishing method difference. 
The last term, 0.209611, is the greatest in 
absolute value; hence the score based on 
observed abilities in meal preparation and 
serving is the most important. The second 
term, 0.011745, is the next in absolute value; 
hence the score on products prepared is next 
most important; and the final test grade is 
least in importance for determining a differ- 
ence between these two methods of teaching 
home economics. These factors may have 
different ranks for other pairs of methods. 

A similar analysis of variance of scores 
from two other classes taught by similar 
methods shows that the experimental method 
of teaching secures results significantly supe- 
rior to the other. The difference between the 
compound means is 


(4) 


D = 0.01408 — 0.00130 + 0.05430 
06708. 


The most important score in this case for 
distinguishing a method difference is score on 
ratings of the observed abilities involved in 
meal preparation; the next in importance is 
the final test score; the last in importance is 
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the score on products prepared. The ranks of 
importance of these scores are not the same 
as in the first case considered. These ranks 
might be different for another teacher or for 
another group of students. The linear com- 
pound or discriminant function enables one, 
for the data observed, to determine whether 
or not there is a difference, as judged by the 
three types of results, between the two meth- 
ods of teaching; it also enables one to ascer- 
tain the relative importance of these three 
scores. 

For both groups the value of D was posi- 
tive; hence the experimental method appears 
to be superior to the other as far as teaching 
this material to students is concerned. 

The discriminant function can be used 
effectively to compare compounds made up of 
several sets of observations. 
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A STUDY OF THE RELATIONSHIP BETWEEN GRADE 
AND AGE AND VARIABILITY 


Fer Tsao 
University of Toronto 


I. INTRODUCTION 


Within the field of individual differences it 
is widely accepted as common sense that vari- 
ability in every mental function is increased 
by age and by grade. This conclusion, for 
example, is accepted and emphasized by the 
late Professor Sandiford(17, p. 197), as fol- 
lows: 

. When we turn to a similar problem, 
namely, the effect of age on variability, the 
results are unequivocal. Age increases vari- 
ability. This means that babies are more 
alike than public school pupils, public 
school pupils than high school pupils, and 
high school pupils than adults. B.A.’s are 
more alike than M.A.’s or Ph.D.’s . . .” 


Disregarding any philosophical backgrounds, 
one of the leading concepts in the current 
educational system, namely that the child 
must first receive general training, is based 
upon this common sense principle. 

From the statistical point of view, however, 
the advanced statistical techniques such as 
analysis of variance(7, pp. 34-42) and 
factor-analysis‘ are based upon an assump- 
tion that all subclasses tested should have 
similar estimates of variability. If we find 
that variability increases with age or grade, 
then any advanced statistical analysis of psy- 
chological or educational data will become 
impossible. So the present problem becomes 
important not only because it links with an 
educational theory but also because it chal- 
lenges the recently developed statistical tech- 


The writer uses some new statistical tech- 
niques in order to attack this problem more 
impartially. A special examination of some 

on, ul th 

—aet*) and Feemests) have wegen Be 


of tests or sub-tests w 
i pointed out 
played 
eventually has an important influence on the estimates of 
. Correlation. variability is different for F_.- 


factorial analysis is based, will be higher than 
Py 1, and hence one cannot clearly identify any significant 
Psychological factor. 


previous studies is made by using these tech- 
niques. Finally, an experimental study is 
carried Out and the results analyzed by the 
same techniques. Throughout the study the 
writer has endeavoured to reach some con- 
sistent conclusions by attacking this problem 
as objectively as possible. 


II. PRoBLeMs oF STATISTICAL TECHNIQUES 
Neyman-Pearson’s L, Test 


Early in 1931, Neyman and Pearson devel- 
oped the well known L, (likelihood criterion 
1) test, ie. the test of a hypothesis that they 
called: 


H,: o=¢ (1) 


This is the hypothesis that the samples have 
been drawn from populations having the same 
constant standard deviation c. 

In 1936, Nayer(11) presented a paper in 
which he (a) considered the of the 
L, test; (b) provided tables(ibid., p. 51) of 
the 5 per cent and 1 per cent probability 
levels for L,, where the different samples are 
of equal size; and (c) considered how far in 
the case where the samples are of unequal 
size the probability levels for the L, might be 
obtained from his tables, entering them with 
the average sample size. His results were 
satisfactory. Welch(25), in the meantime, 
suggested an adaptation of the L, test, using 
the following equation: 


O,’ | ms 
L—0( )in(s x67 5 (2) 
nm, J 


where s == 1, 2,..., 4; & denotes the number 
of samples; m, denotes the number of indi- 
viduals within the s-th group; N denotes the 
number of individuals in all the samples; ©,’ 
is sum of squares of the errors or the residuals 
of the s-th sample; ~ denotes product; and = 
denotes summation. 

The L, test is very useful for testing 
whether there are significant differences in 
variability among several groups, especially 
small groups. As our problem is mostly con- 
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cerned with the relationship between vari- 
ability and age or grade, this test will enable 
us to determine whether or not there are 
actual differences in variability, and hence 
determine the significance of apparent rela- 
tionships. 

Let us illustrate the use of the L, test. 
Table I gives the data for the Dominion 
Group Test of Intelligence (Junior), Form B, 
obtained by the Department of Educational 
Research, University of Toronto, in a certain 
school. In this table, M, is mean score of the 
$-th grade; s,’ is an unbiased estimate of 
standard deviation for the s-th grade, which 
equals \/ 9 ,’/(nm,— 1); and nm, and ® ,’ are 
defined as earlier. 
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lated L, is greater than the 5 per cent point, 
The rules to be followed in using these tables 
are: 


(a) to reject the hypothesis when the L, is 
less than the corresponding 1 per cent point; 

(b) to accept the hypothesis tested when 
the calculated L, is greater than the corre. 
sponding 5 per cent point; 

(c) to remain in doubt when the L, lies be. 
tween the corresponding 5 per cent and 1 per 
cent points. 

So in our case we accept the hypothesis 
H,, and we conclude that these three grades 
are homogeneous with regard to variability, 

The technique of the L, test is fairly effec. 
tive in testing the significance of the differ. 


TABLE I 
THE DATA FOR THE DOMINION INTELLIGENCE TEST (JUNIOR) IN A CERTAIN SCHOOL 


To find the value of L,, we first calculate 
the value of log L,, which comes from for- 
mula (2) 


5 


- I 
log L, = mek — a, log , +7 


an, log ® ,’ — log (3° ,’) (3) 
$s 


and then find ZL, from a table of antiloga- 
rithms. 


M, 8’, 
37. 82 3662. 9091 
45. 31 2935. 6389 
58. 76 2358. 1176 


ences in variability among different groups. 
There are, however, at least two demerits in 
this test: 

(a) The range of the L, value is only from 
o to 1. Therefore the test will not be very 
sensitive. 

(b) When the value of f, is larger than 59, 
the process of the interpolation between 50 
and c will certainly involve many errors. So 
the test is not very useful for groups of larger 
size. 


TABLE II 
CALCULATION OF LOG L, FOR THE DATA FROM TABLE I 


Ns 
32 33 
35 36 
33 34 
100 103 


Zn, logn, = 158.3083 


8’, log 8’, 
3662. 9091 3. 5638 
2935. 6389 3. 4677 
2358. 1176 3.3726 
8956. 6656 Z.n,log 8’, = 357.1110 


Where f, = n. — 1, denotes the degrees of freedom for ®,’. 


For the present example, we find L, = .981, 
k = 3, and f, = 3f./3 = 33.33. The hypoth- 
a 


esis we wish to test is H,: o, = a, i.e. there 
are no significant differences among these 
three grades in variability. Referring to 
Nayer’s tables of the L, distribution with 


k = 3 and f, = 33.33, we see that the calcu- 


In view of these two demerits, another 
technique, which is known as Bartlett’s test 
of homogeneity of variance, should be con- 
sidered. This will be presented in the next 
section. 

Bartlett’s Test of Homogeneity of Variance 

Bartlett(2) uses the x? distribution as an 
approximate test of the homogeneity of sev- 
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eral estimates of variance. He found that the 
quantity 


a(F loge s* — xfs loge $,"*) — 
s 


2.3026 
Cc 


(F logyo 8”? — Sf logy, 55’*) (4) 
$s 


where $= 1, 2,..., &; S,”* denotes the esti- 
mate of variance in the s-th group; f, denotes 
the corresponding number of degrees of free- 
dom; & denotes the number of groups; and 
also 

F = xf s 

s 
2 — sf, s,/? — L230,’ (9,’ is 
F, s Ss F pr ns oy 
defined as earlier) 
I ar 
_ , ae 

Cm + aa 35-F) 
is approximately distributed as x* with k — 1 
degrees of freedom, and exceedingly large 
values of y* will indicate the presence of sig- 
nificant differences among the several esti- 
mates of variance. We use Fisher’s table of 
the x’ distribution(4, pp. 110-111) as a basis 
for testing the homogeneity of the estimates 
of variance. There will be discrepancies ex- 
isting among the estimates of variance if ? 
lies beyond the corresponding 1 per cent 
point; the estimates of variance may be re- 
garded as homogeneous if x’ is less than the 
corresponding 5 per cent point; and no con- 
clusion will be made if ,? lies between the 
corresponding § per cent and 1 per cent 
points. 

In regard to our problem, the estimates of 
varjance are generally used to denote the 
different degrees of variability. So we can use 
this technique to test H,: o, = o as well. 

For example, if we wish to test the homo- 
geneity in variability of the three grades 
given in Table I, the method to be used in 
calculating the value of x’, using 5,’*, is as 
shown in Table III. 
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We find — 
5’? == 8956.6656/100 = 89.5667 
F log s’* = 100(4.49499) = 449.49900 


none woes +=— 


is) = 1.02671 


aX, = (449.49900 — 448.61191)/1.0267, = 
864 


In using Fisher’s table of x*, the probability 
of a greater value than this, for 2 degrees of 
freedom, is between the 50 per cent and 70 
per cent points, so we accept the hypothesis 
a, = a, i.e. we conclude that the three grades 
may be regarded as homogeneous in vari- 
ability. 

The results obtained by using Bartlett’s 
test agree with those from the L, test. In 
some cases, however, they may not agree with 
each other. Since the range of the x’ distribu- 
tion is very large (from o to «), and the 
number of cases in each group plays no im- 
portant part in the test of significance of ,’, 
Bartlett’s test will be more sensitive than the 
L, test, especially for groups of larger size. It 
will be wise, therefore, to use both of these 
tests throughout our study as a check on the 
accuracy of the results. 


Analysis of Variance for Unequal 
Subclass Numbers 


Snedecor(19; 18, pp. 235-240) has devel- 
oped several methods dealing with the analy- 
sis of variance for unequal or disproportionate 
subclass numbers in the field of agriculture 
and biology. None of his methods was very 
satisfactory, so he inclined to assume that 
disproportionate subclass numbers must be 
looked upon as a fault of experimental design 
to be avoided if possible (18, p. 239). In the 
field of education or psychology, however, the 
subclasses tested nearly always consist of un- 
equal or disproportionate numbers of subjects. 
The writer(24) has developed a technique 


TABLE III 


ILLUSTRATION OF THE METHOD TO BE USED IN CALCULATING x’ 
(Value of s’,? From Table I) 
8’? 0’, 
114. 4659 3662. 9091 
83. 8754 2935. 6389 
78. 6912 2358. 1176 
8956. 6656 


log.s. 
4.74027 
4. 42934 
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which is different from Snedecor’s and seems 
to be more suitable for use in the situation 
which generally exists in the field of educa- 
tion or psychology. This new technique can 
be used in our problem in order to detect 
some selective factors involved in different 
schools or different grades in our experimental 
study which will be shown later. 


Hoyt’s Procedure for the Testing of 
Variability Affected by 
Test Materials 
Hoyt(6) developed a procedure for meas- 
uring the variances between items and be- 
tween individuals. The variance between 
items links with the heterogeneity of the 
respective difficulties of the items, while the 
variance between individuals is a measure of 
the individual differences between subjects. 
A reliable test must have a very large value 
of the variance between individuals in com- 
parison with that of the residual variance. 
Assume that X,,; (s = 1, 2,...,4;%8= 
I, 2,...,; & denotes the number of items; 
and m denotes the number of individuals) is 
defined as the score of the i-th individual on 
the s-th item, which is presumably 1 or o. 
Define also 
>>» OF 
Si 
N 
where VN — kn 
Xai 
i 
n 
Xai 
‘eae 
c= k ; 
Then the sum of squares between items is 
=(2X,;)? 
wa tz, .— 2 2p St. 
Si 


X..= 





(22X,;)? 
Sst 


sielletihiaanin (7) 
N 
and the sum of squares between individuals is 
=(2X 51)? 
Wits .(— 2. Pum lt 2 — 
st A 
(=2X,;)? 
St 
N 
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Since 
X,i = 10ro (9) 


Therefore 
Xs = X.i” ( 10) 


And the total sum of squares is 
XEX,) (N a == si) 
33 (X.,;—X..)? — *# s 
Si N 
n,n, 


W (11) 


where m, == 33X,,, i.e. the number of correct 
Si 

responses of all the subjects on all the items, 

and m, is the number of incorrect responses, 

Again, if we define 


nm, (12) 


ee 
which is synonymous with X . . , i.e. actually 
the ratio of the ordinary mean score for this 
group of subjects to the total number of 
items.’ And also 
q=1—?p (13) 
It follows that 


ae 8 
N (14) 





Then the total sum of squares may be 
written in the following simple form 
mM, 


We Nea (15) 


By using this procedure we can readily see 
how important a réle is played by the value 
of ordinary mean score in the estimate of 
variability. Consequently, if we find some 
changes in standard deviations from grade to 
grade, we should use this technique to exam- 
ine whether these changes are actually the 
results of true psychological situations or 
merely a function of the test materials. 


Fisher’s Test of the Significance of the 
Difference Between Two Correlations 


Fisher(4, pp. 190-196) has developed a 
method for testing the significance of the dif- 
ference between two correlations by using the 

1 It is clear that 


an k 
where M is defined as the ordinary mean score for this group. 
So the above statement is true. 


March, 1 


heteroge 
tions fro 
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technique of the logarithmic transformation. 
One can easily employ this procedure by re- 
fering to Fisher’s book. By using this 
method in our problem, we can test the 
heterogeneity of the obtained inter-correla- 
tions from grade to grade. 
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variability. It must be kept in mind, how- 
ever, that: (1) we are interested in variability 
and age, while they were interested in vari- 
ability and sex; and (2) we use the tech- 
niques discussed in Chapter II, while they 
used the calculation of the ratio of S.D. to 


TABLE IV* 
THORNDIKE’s CAVD DATA For CHILDREN IN GRADES 8, 9 AND 10. CLASSIFIED ACCORDING TO AGE 


Male 
M 


Female 
N M 


June 1924, City 2 


June 1922, City 2 


42.84 
48. 65 
46.95 
44. 68 


42.84 
44.98 
44.11 
43. 51 


June 1922, City 1 


42.71 
44.87 
45.44 
46.04 
46. 89 


Analysis of Covariance 


The method of covariance is an extension 
of that of analysis of variance. Instead of 
breaking up the sum of squares, we break up, 
in the analysis of covariance, the sum of 
products into parts ascribable to different 
components. If we have two functions, for 
instance, we make a separate analysis of the 
variance of each; and if the two functions 
are related, we may simultaneously make an 
analysis of covariance. The details involved 
in this method need not be discussed here as 
they are dealt with in books relating to ad- 
vanced statistics (4, pp. 264-278; 7, pp. 67- 
96; 16, pp. 150-157; 18, pp. 249-273). And 
in our problem, we can use this technique if 
necessary. 


III. Spzectat EXAMINATION OF 
PrREvIoUs STUDIES 


In this section, we wish to examine thor- 
oughly some extensive tests which have been 
used by McNemar and Terman(1o) as foun- 
dations of their study on sex differences in 


41.15 
43.72 
43.68 
46. 81 
46. 46 


* Reproduced from McNemar and Terman (10, p. 31), but the computation of s’ was made by the 
writer. 


Gaitt- For convenience, we classify the data 
into two categories: psychological and scho- 
lastic, as shown below. 

The letters N, M and s’, which appear in 
the tables in this chapter, denote the number 
of individuals, the mean score and the un- 
biased estimate of the standard deviation of a 
certain group, respectively. 


Psychological Data 

1. Intelligence 

a. Thorndike’s CAVD Data.—In Table IV 
are found Thorndike’s data(21; 22) for his 
CAVD intelligence tests of school children of 
ages 13 to 17 attending certain grades. By 
using the L, test and Bartlett’s procedure, we 
find that variability in CAVD scores is con- 
stant from age to age for each sex. 


b. Deta from Whitmire-—The data of 
Whitmire(26) are based on the National 
Intelligence Test scores of all the public 
school pupils in a small city. Her results are 
presented in Table V. 
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TABLE V* 


N. I. T. OF UNSELECTED CHILDREN IN VALLEJO,‘CALIFORNIA 
(Data From Whitmire) 
Male Female 
M M 


62.7 ‘ 76.3 5 
107.8 ‘ 120.0 1 
148.5 : 158.1 1 
171.9 a 190.0 3 
208.8 a 216.0 3 
221.6 ; 244.4 47.0 

_ * Reproduced from McNemar and Terman (10, p. 32), but the computation of s’ was made by the 
writer. 


By using the same method, we find that there is no definite evidence of a consistent 


there is no significant change of variability relationship between age and variability. 
in intelligence from age to age for each sex. nen alk tin he tudi intelli 
c. Datu from Pressey—The intelligence = oe oe eee OC 


test results of Pressey(14) are presented in the deduction may be reached that there can- 
Table VI. not be found any evidence of an increase or 


TABLE VI* 
INTELLIGENCE TEST RESULTS WITH UNSELECTED SCHOOL CHILDREN 
(Data From Pressey) 


Male Female 
N M N M "¥ 
57 58. 20 : 92 64.72 
132 70.72 ; 153 74. 04 23.44 
176 79.12 , 177 85. 86 25.51 
179 94. 90 . 165 100. 32 27.96 
182 106.34 f 180 110. 97 25.94 
174 116.60 ’ 174 124. 11 23.97 
138 121. 88 , 163 131. 02 22.11 
102 131. 07 ' 139 139. 51 19. 30 
_ * Reproduced from McNemar and Terman (10, p. 32), but the computation of s’ was made by the 
writer. 


We find, through the same method, that: decrease of variability in intelligence with 
(1) for boys, variability in intelligence has 8°. 
no significant change from age to age; and 9. Learning Capacity 
(2) for girls, although variability in intelli- Pyle’s data(1s5) on learning to sort cards 
gence changes significantly from age to age, are given in Table VII. 


TABLE VII* 
Pyte’s DATA oN LEARNING CAPACITY AS MEASURED BY CARD SoRTING 


Male Female 
N M N M 

47 66 

120 139 

130 162 

142 182 

162 191 

172 175 

184 188 

143 178 

121 148 

72 127 201. 

* Reproduced from McNemar and Terman (10, p. 39), but the computation of s’ was made by the 

writer. 
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Following the method used before, we find 
that: (1) for boys, although there are sig- 
nificant, differences in variability of learning 
capacity between ages, no real relationship 
can be found between age and variability; 
and (2) for girls, variability in learning 
capacity is constant from age to age. 


Scholastic Data 


The Stanford Achievement Test was given 
by Baldwin(1) to school children of certain 
age levels for three successive years. His re- 
sults are presented in Table VIII. 
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2. Follow-up Study 


First we combine the group of 8-years-old 
children in 1923 for three successive years. 
By using the L, test and Bartlett’s procedure, 
we find for this group of children from age 8 
to age 10 that variability in scholastic func- 
tion increases with age. 

Next we analyze the results of 9-years-old 
children in 1923 for three successive years. 
Using the same method, we find that vari- 
ability in scholastic ability is constant from 
age to age. 


TABLE VIII* 
REPORTED TESTS WITH STANFORD ACHIEVEMENT TEST BATTERY 
(Data From Baldwin) 


Maile 
N M 


Female 
Z N M 


Tests in 1923 


100 1 
117 
96 


-0 
S| 
oy 


6.78 
10. 24 
11. 70 


115 
126 
87 


Same Groups Tested in 1924 


100 
117 
96 


10.18 
11.63 
12.95 


115 
126 
87 


10. 76 
10. 52 
11.04 


Same Groups Tested in 1925 


100 
117 
96 


41.0 
49.0 
54.4 


writer. 


To analyze these data, we wish to give two 
kinds of studies: (1) a comparative study of 
different groups of individuals for each period 
of testing; and (2) a follow-up study of the 
same groups of individuals from 1923 to 
1925. These two kinds of studies are shown 
as follows. 


1. Comparative Study 


First we examine the results for 1923. Fol- 
lowing the method used before, we conclude 
that for the data of 1923, variability in scho- 
lastic standing significantly changes from age 
8 to age 10. 

Next we analyze the results for 1924. 
Using the same method, we find that there 
is no significant difference in variability of 
scholastic function for different ages. 

Finally, we examine the data for 1925. 
Proceeding as before, the same conclusions 
can be reached as for the data of 1924. 


11.98 
12.23 
13. 35 


_ * Reproduced from McNemar and Terman (10, p. 49), but the computatio 


115 
126 : 
87 : 


10.75 
11. 37 
12. 85 


of s’ was made by the 


Finally, we analyze the data of 10-years- 
old children in 1923 for three successive 
years. Proceeding as before, we find that 
there is no significant difference in variability 
of scholastic function for different ages. 

From all the above studies, it may be 
noticed that we can find significant differences 
in variability of scholastic ability only when 
the subjects consist of 8-years-old children. 
This may be due to the fact that Stanford 
Achievement Test might be too difficult for 
8-years-old children, so that the individual 
differences among them cannot be disclosed as 
well as for the children of other age levels. 
Alternatively, there might be involved in 
these subjects a selective factor. Since the 
detailed data necessary for a study of this 
point are not available, we cannot make a 
further analysis here. Moreover, we find four 
out of six cases which show constancy of vari- 
ability in scholastic ability for each sex. So 
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TABLE IX* 
OBTAINED DATA FOR CHRONOLOGICAL AGE 
School A School B 
M 


8 N M s’ 


35. 27 10. 64 26 83. 92 12. 38 
46.03 11.17 27 47.74 14. 08 
59. 66 13.25 84 58. 41 12. 48 
67. 20 9.79 82 67. 22 11.14 


* In this table and Tables X, XI, XII, XIII, XIV, XV, the letters N, M ands’ denote the number of 
individuals, the mean score and the unbiased estimate of the standard deviation of a certain group, re 
spectively. in, in this table and the next, all the mean scores of chronological age and mental age are ip 
terms of months; and for convenience in calculation, 100 has been subtra from all the scores. 


it is safe for us to conclude, at least, that examine the data in Table IX, we find that 
there is no consistent proof to show that the mean chronological age for grade 5 of 
variability in achievement increases with age. each school is around 11 years; for grade 6, 
12 years; for grade 7, 13 years; and for grade 
IV. EXPERIMENTAL STUDY 8, 14 years.” It is clear also that there is no 
We asked the teachers of two Canadian ‘ignificant selective factor for chronological 
public schools, denoted by A and B in the 48¢ in the different grades, because the mean 
following tables, to give tests in grades 5-8. - Chronological age is constantly increased by 
The materials used were the National Intelli- ®PProximately one year from grade to grade. 
gence Test, Scale A, Form 1; Schorling— 
Clark~Potter Arithmetic Test, Form A; and 2. Mental Age ; 
Gates Reading Survey, Form 1. The results og data for mental age are presented in 
of the analysis of these tests will be given able X. 
and discussed in the following sections. By using the same methods, we find that: 
(1) for school A, although variability in 
Analysis of the Data Relating to Intelligence mental age appears to increase ftom grade to 


. grade, the differences are not significant; 
1. Chronological Age (2) for school B, variability in mental age 
The obtained data regarding chronological jis constant from grade to grade; 


age are presented in Table IX. (3) as far as mental age is concerned, 
By using the L, test and Bartlett’s proce- there is a significant selective factor involved 
dure, we find that there is no significant jn the two schools. This, however, does not 
change in variability of chronological age for affect the above statements; 
both schools from grade to grade or for each (4) there are significant differences in 
grade from school to school. Moreover, in mental age between grades, as is to be ex- 
. Making a complete analysis of variance for pected. This raises the point, however, of 
these data, we find that: (1) there is no sig- whether or not there is an important selective 
nificant difference in chronological age be- factor involved in these grades. This ques 
tween schools, so it might be assumed that tion will be answered by analyzing the data 
the selective factor in the two schools, as far relating to IQ scores. This analysis is pre- 
as chronological age is concerned, is very sented in the next section. 
slight and does not affect our results; and * Note that in Table IX, the mean age in terms of months 
(2) there are significant differences in chrono- has been decreased by 100. So if we wish to change it into 


* . actual age in terms of years, it must be first increased by 100 
logical age between grades. However, if we and then divided by 12. 


TABLE X 
OBTAINED DATA FOR MENTAL AGE 


School A 
M 8’ N 
49.78 10. 82 26 
61. 36 14.79 27 
67.88 16. 38 34 
78. 28 17. 03 82 
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TABLE XI 
OBTAINED DATA FOR INTELLIGENCE QUOTIENT 


School A 
M 


111. 51 
111. 28 
106. 16 
107. 50 


3. The 1Q Scores* 

The obtained data for the IQ scores are 
shown in Table XI. 

Proceeding as before, we find that: 

(1) For both schools variability in intelli- 
gence quotient is constant from grade to 
grade. This, however, bears no material 
meaning in reference to the footnotes of this 
section. 


School B 
N M 8’ 


26 98. 00 13. 45 
27 98. 85 19. 08 
34 101.21 16. 55 
32 105. 63 14. 92 


Analysis of the Data for Arithmetic Ability 


The data for arithmetic scores of Schorling- 
Clark—Potter Arithmetic Test are presented 
in Table X11. 

By using the L, test and Bartlett’s proce- 
dure, we find that variability in arithmetic 
scores appears to increase with grade for both 
schools. In further analysis with Hoyt’s pro- 
cedure, however, this result is found to be 


TABLE XII 
OBTAINED DATA FOR ARITHMETIC ABILITY 


School A 
M 


s 
11.49 5. 
29.74 8. 
15. 

6 


49. 88 
61.94 


(2) As far as the IQ scores is concerned, 
the selective factor is significantly involved in 
the two schools, but not in the different 
grades. So the constancy of variability in 
mental age with grade actually represents a 
true psychological enuntiva. 


d == intelligence quotient 


M = mental age 


the formula for the standard deviation of IQ will be as 
follows(8) : 
‘x oy" 


M* 


x °u 
— 2rey + WT 


We have already found, from the + two sections, that both 
¢z and ¢y are constant from grade to grade. Let us assume 
that, within a certain group, ryy == 0. It follows that ¢, 
should decrease with age or grade. But in the mt case, 
we still find that ¢, is’ i from grade to may 
be due to the negative value of "xu withia eoch grade, as 
= later (see Table XVII). It will also be noticed that 
the IQ scores are used here b: a7 Se as & hae oy = 
selective factor involved between schools: or between a 
The constancy of variability in 1Q in this case, 
have no significant meaning with regard to 
ations because it can be explained from the” negative = 
of r,..- The constancy of variability in mental age, which is 
obtained in the last section, is easier to interpret and probably 
denotes a true psychological situation. 


School B 
M 


9. 35 
21.96 
40.29 
43.28 


8 
3.78 
10. 84 
16. 52 
10. 37 


26 
27 
34 
42 


merely a function of the test material itself 
and does not represent any true psychological 
situation. 


Analysis of the Data for Language Ability 
1. Vocabulary 
The obtained data for the vocabulary 


scores of Gates Reading Survey are presented 
in Table XII1. 


By using the same methods, we find that: 

(1) For each school variability in vocabu- 
lary is constant from grade to grade. 

(2) In both ability and variability in 
vocabulary, there is a significant selective 
factor involved in the two schools. Neverthe- 
less, statement (1) is still true. 


2. Comprehension Score 

The results for the comprehension scores 
of Gates Reading Survey are summarized in 
Table XIV. 

Proceeding as before, we find that: 

(1) For school A, variability in compre- 
hension score is constant from grade to grade. 
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TABLE XIII 
OBTAINED DATA FOR VOCABULARY SCORE 


School A 
M 


48.95 
56. 41 
59. 44 
65. 78 


N 
26 


27 
34 
82 


TABLE XIV 
OBTAINED DATA FOR COMPREHENSION SCORE 


School A 
M 


60. 85 
67.10 


73. 36 


(2) For school B, it appears that there 
are differences in variability in comprehen- 
sion score between grades. The differences, 
however, are not significant. 

(3) In both ability and variability in com- 
prehension, there is a significant selective 
factor involved in the two schools. Once 
more, this phenomenon cannot alter state- 
ments (1) and (2). 


3. Speed Score 

The data for the speed score of Gates 
Reading Survey are shown in Table XV. 

Proceeding as in the earlier cases, we find 
that: 

(1) Variability in speed is constant from 


grade to grade. 


8 
es 
67. 68 8. 
6. 


14. 00 
13.39 
11.75 
7.81 


School B 
M 


44.04 
53. 89 
57.79 
67. 03 


(2) There is a significant selective factor 
in speed involved in the two schools. This 
fact does not affect the finding given in state- 
ment (1). 


Inter-Correlations Between Different 
Functions 
1. Correlations Between 1Q and Other Scores 
for Different Grades 
In Table XVI, we find the correlations 
between IQ and other scores for each grade. 
It is noticed that we have obtained estimates 
of correlations between IQ scores and chrono- 
logical age, which are approximately equal to 
those obtained by Jackson and Ferguson(9, 
p. 123) in their studies of the reliability of 
tests. These estimates, however, cannot rep- 


TABLE XV 
OBTAINED DATA FOR SPEED SCORE 


School A 
Grade M 
5 49. 68 
6 41. 08 
7 42.41 
8 53.25 


26 
27 
34 
32 


TABLE XVI* 
CORRELATIONS BETWEEN IQ AND THE OTHER SCORES 


School A 
N Tix Tim Tia Tw ne 
41 —.78 .75 .12 .44 .650 
39 —.70 .82 .02 .79 «.7i1 
82 —.76 .86 .49 .63 .7i1 
86 —.86 .94 .54 .59 «71 


Grade 
5 
6 
7 


-34 26 ’ 
-7% 27 . 99 .67 .88 
-50 34 . -89 .80 .79 
-61 382 88° .92 .77 .68 


School B 
‘is Tim Tia Tw 


-80 .33 .56 


* In this table and all the tables which come later, X denotes chronological age; M, mental age; I, in- 


telligence oat 


in Gates Reading Survey; C, raw score on comprehension in Gates 
in Gates Reading Survey; N, the number of individuals; and r, the correlation 


A, Taw score on Schorling-Clark-Potter Arithmetic Test; V, raw score a vocab’ 


; S, raw score on speed 
cient. 
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resent any true psychological situation. Early 
in 1897, Karl Pearson(13) pointed out that 
the correlations calculated from indices may 
result in a statistical fallacy. Several other 
research workers (20; 27, pp. 300-301; 8; 
23) have studied this same problem. Since 
| 1Q scores are indices, the correlations listed 
in the above table are affected by the rela- 
tionship between chronological and mental 
ages and other scores, and hence their inter- 
pretation is difficult if not impossible. Of 
course, IQ scores are useful for the purpose 
of showing the relative positions (with re- 
spect to mental development) of children at 
different age levels. Once the scores are re- 
lated to other scores, however, they may fail 
to show the true psychological situation. So 
the above table deserves no further analysis. 
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doing so, we find that there is a certain selec- 
tive factor involved within each grade; but 
not between grades. Since we are only inter- 
ested in comparing grades, the above findings 
dealing with variability and grade are, of 
course, unaffected. 


3. Correlations Between Mental Age and 
Different Scholastic Functions for Dif- 
ferent Grades 


Table XVIII shows the different correla- 
tions between mental and educational scores 
for different grades. From the data of this 
table, we conclude that all the correlations 
obtained are positive but not perfect. This 
agrees with the results of the writer’s previous 
study(23). By using Fisher’s technique of 
testing the significance of the difference be- 


TABLE XVII 
CORRELATIONS BETWEEN CHRONOLOGICAL AGE AND THE OTHER SCORES 


School A 
xm Txa Txv 


—.18 


2. Correlations Between Chronological Age 
and Other Scores for Different Grades 


In order to analyze what is the relationship 
between age and mental or scholastic stand- 
ing in each grade, we present all the correla- 
tions in Table XVII. 

If a tested sample is unselected, then the 
relationship between chronological age and 
mental or scholastic scores will be positive 
and probably high. From this table, however, 
there is evidently a certain selective factor in- 
volved within each grade because most of the 
correlations are negative. It is necessary, 
therefore, to study these data by using the 
method of analysis of covariance in order to 
determine whether or not there is another 
selective factor involved between grades. In 


Txc xs 
.07 —.28 —.24 —.22 
—.17 .02 —.34 —.40 —.40 
—.33 —.34 —.30 —.43 —.17 
—.66 —.36 —.34 —.54 —.38 


School B 
xm Txa Txv Txc Txs 
—.06 .06 —.18 —.24 —.33 
—.65 —.56 —.42 —.53 —.37 
—.34 —.50 —.32 —.43 —.38 
—.55 —.62 —.41 —.57 —.36 


tween two correlations, we test the hypothe- 
sis that the highest correlation equals the 
lowest one from grade to grade. In doing so, 
we find that: (1) although there are signifi- 
cant differences between the correlations of 
mental age and vocabulary for different 
grades of school A, no definite trend can be 
found to show an increase or decrease in the 
estimate of relationship from grade to grade; 
and (2) all the other estimates of relationship 
between mental age and achievement are con- 
stant from grade to grade. 
4. Correlations Between Arithmetic and Lan- 
guage Abilities for Different Grades 

All the obtained correlations between 
arithmetic and different language abilities for 
each grade of each school are summarized in’ 


TABLE XVIII 
CORRELATIONS BETWEEN MENTAL AGE AND SCHOLASTIC SCORES 


School A 
Tmv Tuc 
. 40 . 56 
.19 . 80 . 65 
. 40 . 67 .70 
. 58 . 66 . 73 


School B 
N Tmv 
26 : . 59 
27 ; . 83 
34 4 . 85 
32 : .73 
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TABLE XIX 
CORRELATIONS BETWEEN ARITHMETIC AND LANGUAGE SCORES 


School A 
Grade N Tac 
5 41 . . 33 
6 39 ‘ . 08 
7 32 ; . 35 
8 36 : . 34 


Table XIX. By using the same methods, we 


find that: 


(1) The relationship between arithmetic 
and language abilities is positive but far from 
perfect. 

(2) It appears that the relationship be- 
tween arithmetic and vocabulary scores 
increases with grade for school B. This, how- 
ever, is due to the very small size of variabil- 
ity in arithmetic scores for grade 5. Since 
variability in arithmetic scores changes sig- 
nificantly from grade to grade, as shown in 
Table XII, the differences in the correlations 
bears no meaning in relation to any true psy- 
chological situation. 

(3) All the other estimates of correlations 
between arithmetic and language abilities are 
constant from grade to grade. 


5. Inter-Correlations Between Language 
Scores for Different Grades 
The inter-correlations between language 
scores for different grades of each school are 
summarized in Table XX: 


School B 


Tav Tac Tas 
. 08 . 34 - 25 
. 59 . 60 . 62 
.70 . 64 . 69 
.78 . 64 - 23 


2. The relationships among the mental and 
scholastic functions considered were positive 
but not perfect. 


3. The relationships among the mental and 


scholastic functions considered were constant © 


from grade to grade. 


Educational Implications 

Here the writer gives the following two 
simple educational implications of the above 
findings: 

(1) Since both variability in a mental or 
scholastic function and the relationship be- 
tween any two mental or scholastic functions 
were constant from grade to grade, we shoud 
emphasize individual differences of school 
children as early as possible. Individual dif- 
ferences in learning capacity and achievement 
in a subject are revealed eyen at the earlier 
age levels or in the lower grades. Therefore, 
the fixed curriculum, based upon the provi- 
sion for a normal child, should be readjusted 
by a skillful teacher to meet the situation of 


TABLE XX 
INTER-CORRELATIONS BETWEEN LANGUAGE SCORES 


School A 
lve Ivs 
. 78 . 67 
. 76 .74 
.78 . 63 
. 68 .39 


Proceeding as before, we find that:(1) the 
relationship between the different language 
scores is positive but far from perfect; and 
(2) all the inter-correlations obtained are 
constant from grade to grade. 


V. CoNncLusIONS AND EDUCATIONAL 
IMPLICATIONS 
Conclusions 


1. Variability in a mental or scholastic 
function was constant from grade to grade 
or from age to age. 


School B 


cs N yc Tvs 


. 63 26 . 64 - 70 
- 55 27 . 81 . 85 
. 69 34 -73 . 79 
Ay | 32 -81 . 68 


individual differences even in the 
grades. 

(2) Under the present educational system, 
a child learns arithmetic and language in the 
same grade. The relationship between any 
two mental or Scholastic functions, however, 
will almost certainly not be perfect. Hence 
it is not advisable to gather many children in 
a certain grade, in which child A may be 
excellent in every subject; child B, poor in 
.every subject; child C, excellent in one sub- 
ject and poor in another; and so on. The 
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writer suggests that the standing of a child 
may be at the level of grade 2 for arithmetic, 
grade 4 for language and so on, instead of 
merely at the level of grade 3 as a whole. 
Thus all the children who have approxi- 
mately the same standing in a certain subject 
should be trained in this same subject in the 
same class, and the process of teaching and 
jearning will be rendered more efficient. 
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THE DEVELOPMENT AND APPLICATION OF AGE . 
PROGRESS PERCENTILE NORMS OF ELEMENTARY 
SCHOOL ACHIEVEMENT 


ETHEL L. CORNELL 
Division of Research, New York State Education Department 


Part I 


MEASURING INDIVIDUAL PUPIL PROGRESS AT 
DIFFERENT ACHIEVEMENT LEVELS BY 
Ace Procress PERCENTILE NorMs 


Origin of study.—Problems of individual 
learning have assumed somewhat greater 
prominence within the last decade than they 
previously had, and in the testing field greater 
attention is being given to developing better 
standard reference points for the evaluation 
of individuals than the traditional grade 
norms provide. 

In 1934 an opportunity was presented of 
studying the records of pupils in a small com- 
munity who had been tested annually by 
Stanford Achievement Tests for five years. 
The major objective of the study was to de- 
vise a method by which the progress of pupils 
as measured by tests might be used in evalu- 
ating the effectiveness of the educational pro- 
gram. The testing program in this commu- 
nity had been inaugurated at the same time 
that the educational program in the elemen- 
tary school had been radically modified in 
the direction of informality. Since the em- 
phasis of the new program was upon the indi- 
vidual pupil, it seemed that if an evaluation 
of the program in terms of measurement 
should be possible, it would come about 
through the evaluation of individual pupil 
growth. 

Need of individualized progress norms.—To 
measure individuals’ growth adequately, the 
use of grade norms seemed inadequate. The 
author’s belief was at that time, and still is, 
that a school grade, in a modern school, is an 
equivocal reference point to which to refer all 
questions of standards. If a child is not at 
age for grade, should his educational status 
be measured with reference to the school 
grade in which he is, or the one in which he 
should be if “rightly” placed for his age? For 
twenty years school testing programs have 
been accumulating evidence that: 


1. Even when there is considerable retarda- 
tion and acceleration presumably to “adjust” 
pupils to their school work, it is found that 
pupils are not placed in grades in accordance 
with any rigid criterion that their school work 
is at that grade level. All school surveys have 
insistently shown that in any grade some 
pupils may be found whose accomplishment 
is two to four grades below the grade level 
and other pupils whose accomplishment is 
two to four grades above. 


2. The growing tendency of schools to de- 
crease the amount of age grade retardation 
does not mean that “retarded” pupils reach 
the grade standards appropriate for their age; 
it means that the spread of achievement with- 
in a grade becomes greater, and the meaning 
of a grade departs further and further from 
that of a single standard of difficulty and 
approaches more and more closely that of an 
age standard with approximately the same 
variability found within a chronological age 
group. 

Chronological age as a reference point.— 
People still cling to their traditional concepts 
even as new facts emerge. Nothing is more 
difficult for the educator to conceive, appar- 
ently, than the notion that school achieve- 
ment might be better measured with reference 
to the range of achievement of an age group 
than with reference to a grade, even when the 
grade is known to be a continuously shifting 
datum. The hurdle that cannot be jumped 
seems to be the apparent injustice of compar- 
ing two children of the same age in different 
grades. We cannot expect, so the argument 
runs, that a ten year old in the fourth grade 
will know the work of the fifth grade before 
he has been taught it, and he, therefore, 
should not be compared with a ten year old 
in the fifth grade. But the argument does not 
stand the test of fact. We repeatedly find 
ten year olds in the fourth grade whose tested 
achievement is higher than fifth grade or even 
sixth grade expectation. Children learn many 
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things that they are not specifically taught. 
We cannot, of course, say whether for any 
given child his achievement would or would 
not be higher if he had been placed in a 
higher grade, but we do know that it is in 
some cases much higher—three or four 
grades higher—than the “normal” expecta- 
tion for his grade. 


gram in the elementary school was made. The 
ages of the children at the time of their first 
test varied from 7 years to 16 years and the 
grade location from 2 to 8. Stanford Achieve. 
ment Tests were used throughout; in the first 
year, the form used was the old Stanford and, 
thereafter the revised “New Stanford” was 
used. 


; TABLE I 


AGE AND GRADE PLACEMENT OF PUPILS WHO HAD THREE ANNUAL TESTS IN 
SEPTEMBER OF YEAR OF First TEST 


10 


Number 
11 of cases 





7 
27 
22 
11 

4 





Number of cases. 35 71 


Importance of including range of variabil- 

ity in concept of standards for measuring 
pupil progress—A few test makers have rec- 
ognized this problem and tried to create 
meaningful grade standards by using for 
standardization only pupils at age for grade. 
This is a great advance so far as stabilizing 
the meaning of a “grade norm” goes, but it 
does not afford any criterion of performance 
for the pupil who is nct placed at age for 
grade. The exceptional pupils are omitted not 
only from the establishment of norms but 
also from the determination of variability. 
But as a criterion for appraising the progress 
of individual pupils who are not “at age for 
grade,” the variability of an unselected group 
is exactly what we need to have. Selection of 
those at age for grade may eliminate from 
the standardization data from a quarter to a 
half of the total number of children. 
' Purpose of analysis and limiting factors in 
data.—To get a reference point by which we 
might more adequately measure progress of 
pupils who are not close to the average, the 
data from the community above referred to 
were analyzed by chronological age. All the 
pupils considered were tested at least three 
times at yearly intervals and many of them 
had four or five tests. It was, therefore, pos- 
sible to plot progress for the same individuals 
for at least two or three years. 

Several factors about the local situation 
must be stated in order to interpret the test 
findings. 

1. The testing began in the year that the 
change from a traditional to an informal pro- 


2. At the beginning of this period, there 
was a considerable range of chronological age 
within each grade, as a result of previous 
promotion policies. During the period, a 
policy of regular promotion was maintained, 
with few exceptions, but as no radical shifts 
in the grade status of individuals were made, 
the spread of ages remained about as it was 
under the traditional program. The grade 
placement of pupils of each age group is 
shown in Table I, as it was in September of 
the year in which they had their first test. 
(This is not an age grade distribution for any 





Chart 1. Explanation. Each line on the chart 
represents a given percentile level for the same group 
of children tested at three successive ages. Each 
group is composed of children within a range of one 
year in chronological age, and each line on the chart, 
therefore, indicates the progress made by children 
of a given age at a given percentile level for a period 
of two years. Different types of lines are used merely 
for convenience in reading the chart. For example, 
the topmost line on the chart, labeled P90 and be- 
ginning at a chronological age of 12-2 and an educa- 
tional age of 14~11, represents the progress made at 
the 90th percentile of a group with a median age of 
12-2 (range 11-9 to 12-8) at the time of their first 
test and a median age of 14-2 at the time of the 
third test. The 75th, 50th, 25th, and 10th percentile 
progress lines of this same group of children will be 
found beginning at successively lower points in the 
same vertical line. Similar progress lines have been 
drawn for groups ‘first tested at different chronolog- 
ical ages and followed for two years. Since testing 
did not extend above the eighth grade, the group 
that was first tested at age 13-3 and followed to age 
15-3 represents the retarded portion of this total age 
group. It will be noticed that the progress lines at 
each percentile level for this group above P10 are 
quite out of line with the other groups. 
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Chart 1.—Progress in educational age for two years at 5 percentile-levels of pupils having 
their first test at successive ages. 
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Single year, since the first test might have 
been in 1928-29, 1929-30, or 1930-31.) 

3. The age progress norms which have been 
developed from these data are probably not 
exactly typical, for two reasons. In the first 
place, the average I.Q. of the school was less 
than 1oo—about 94. In the second place, the 
progress made by pupils was influenced not 
only by the new program but also by the 
previous program. The effect of the old pro- 
gram was certainly greater on pupils more of 
whose schooling had been under the old pro- 
gram—that is, on the pupils who were older 
at the beginning of the testing period. This 
is a complicating factor which is impossible 
to isolate but which may make a comparison 
of the progress of older and younger children 
less valid than is desirable. 

Nevertheless, although this situation is 
perhaps not typical, it can be used to illus- 
trate the development of a technique which 
is applicable for evaluation within the com- 
munity and which has promise of being valu- 
able if developed on a larger scale. Part I 
will present the method and the age progress 
percentile norms derived from it, with some 
illustrations of their use for the guidance of 
pupils. Part II will deal with further appli- 
cation of the technique to other problems of 
evaluation. 


Technique used to develop progress norms. 
—The method was simple enough for a per- 
son with any training in educational meas- 
urements to use for himself. However, it re- 
quired a great deal of clerical spade work, 
since the individual cumulative folders of 
about 1500 children had to be examined to 
get the records of successive tests and other 
pupil data with which the analysis was con- 
cerned. These records yielded slightly over 
600 cases with usable data, of whom approxi- 
mately 460 pupils had had at least three 
tests at yearly intervals. 

1. The first step was to classify the pupils 
according to their chronological age. An age 
group was regarded as comprising those chil- 
dren who had reached a given birthday on 
September 1 of the school year in which the 
first test was given. The median chronolog- 
ical age of an age group in September would 
normally be x years and 6 months, and the 
median age in June (when the testing was 
done) 9 months more. Certain of the age 

1The work was done with clerical assistance furnished by 
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groups varied slightly from expectation as jp. 
dicated below, but the total range of each 
group was exactly 12 months, except the § 
year group in which the total range was ap. 
proximately 6 months. (The reason for this 
is that children in Grade 1 were not tested 
and 6 year olds did not reach Grade 2 before 
the middle of their sixth year.) The medians 
of the chronological age groups in September 
and June were as follows: 


September 6-10 7-6 8-5 9-5 10-6 11-5 124 
June 7-7 8-3 9-2 10-2 11-3 12-2 134 


2. The second step was to make a distribu. 
tion of the educational age scores on Stanford 
Achievement of each of these chronological 
age groups for the first, second and third year 
tests, and to determine the goth, 75th, soth, 
25th and roth percentiles of each distribution, 

3. The third step was to plot each of the 
five percentile points for each age group on 
the three successive tests and study the 
trends. These are shown in Chart 1. A pre- 
liminary analysis of gains had indicated that 
those whose first scores were low made the 
largest gains, particularly at the younger 
ages, and we were, therefore, prepared for the 
steep slope of the Pro and P25 lines at the 
younger ages, which will be observed on the 
chart. A notable fact about these trends that 
may be readily observed is the much greater 
inconsistency of the Pro lines than of the 
P75 and Pogo lines. For example, at age 11-3, 
the tenth percentile of the group tested first 
at that age was very low—educational age of 
7-9, while the tenth percentiles of the groups 
having their second and third tests at the 
same age (or 11-2) were much higher—edu- 
cational age of 9-5 and 9~—4, respectively. It 
is probable that this indicates progressively 
greater attention to the needs of slow children 
as the new program developed. 

4. The fourth step was to draw a line of 
best fit from these progress trends for each of 
the five percentile points, P90, P75, P50, 
P25 and Pro. This was done by inspection, 
giving somewhat less weight to extremes such 
as above mentioned and discounting to some 
extent the empirical points found for ages 13 
and above, since there was undoubtedly some 
selection at these ages. (The testing program 
did not go beyond the eighth grade; pupils 
who had three tests could not have been 
above 6th grade at the first test; age 13 is 
overage for Grade 6; some pupils of ages 13 
and 14 and most pupils of 15 are already in 
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Chart 2.—Tentative percentile progress norms derived from data of Chart 1 (See also Table VI) 


Note: The hypothetical lower limit represents an estimate of the lower limit of educational achieve- 
ment attainable by mentally defective pupils acceptable in public school (excluding those below moron 
grade). The hypothetical upper limit is very hypothetical. Some individuals are found above this ceiling 
but it is assumed that individualized provision must be made for such individuals and that this may be 
regarded as a tentative ceiling for the upper ranges of achievement to be provided for thru school groups. 
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the goth grade.) The lines of best fit are 
shown in Chart 2. 

Chart 2 is to be interpreted as presenting 
age range and age progress norms (i.e., mul- 
tiple reference points) against which to 
measure a pupil’s progress, at whatever level 
of achievement his test status may place him 
at a given time. This procedure is in con- 
trast to the usual use of norms. On the basis 
of grade norms there is no way of judging 
what progress a pupil should be expected to 
make from year to year, if his status deviates 
from the grade norm. Using age range, age 
progress norms, a pupil’s progress may be 
measured in terms of the progress to be ex- 
pected at his level of achievement. 

Illustration of use for appraisal of indi- 
viduals’ progress —An illustration will make 
clear the difference in interpretation between 
the two approaches. Suppose we have three 
pupils who, at the beginning of Grade 3, 
made scores on the Stanford Achievement 
Test equivalent to a grade of 4.1, 4.1, and 
3.0, respectively. Suppose Pupil A was aged 
8-6 at the time, Pupil B, 9-6, and Pupil C, 
10-0. What criterion have we by which to 
judge whether their progress is satisfactory 
or not by thé time they are 14-6 years of 
age? The grade norm provided by the Stan- 
ford test at age 14-6 would be 8.5. Would 
this be a satisfactory status for all three 
pupils? Should we expect more of one than 
of another? 

Using the progress chart, we could inter- 
pret their expected progress as follows (if the 
conditions were similar to the conditions 
under which these progress norms were de- 
rived): Pupil A, at a chronological age of 8—6, 
and a grade test status of 4.1, had an educa- 
tional age of 10-0. (Educational age and 
grade equivalents are given by the Stanford 
norms.) This is equivalent to a percentile 
position for his age of Pogo, according to 
Chart 2. If he maintained his relative posi- 
tion in his age group up to age 14-6, he 
would then have an educational age of 17-2, 
about a year beyond the tenth grade norm. 
Pupil B, at a chronological age of 9-6, with 
an educational age of 10-0, would have a 
percentile position about half way between 
Pso and P75. If he maintained the same 
relative position in his age group at age 14-6, 
he would have an educational age of 14-9, or 
a grade equivalent of almost 9.0. Pupils A 
and B, beginning Grade 3 with the same de- 
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gree of grade acceleration, should not be ex. 
pected to reach the same point at age 14, 
Consider Pupil C, at the opposite extreme 
—a child whose grade placement is 3.0 a 
age 10-0, and whose test status at that time 
is grade 3.0—equivalent to an educational 
age of 8-6. Should we expect this child to 
reach 8th grade standards by the age of 
14-6? According to Chart 2, if he main. 
tained his relative position, which at age 10+ 
was slightly above the tenth percentile of his 
age group, he would reach an educational age, 
at 14-6, of 10-8, or a grade status just above 
the middle of Grade 4. If we give the ten 
year old an extra year and a half (since he 
was one and one-half years older than Pupil 
A at the beginning of Grade 3), he would 
still, at age 16, have attained a status barely 
equivalent to the beginning of sth grade, 
Computations may be somewhat more easily 
made from Table VI, at the end of Part I. 


If these progress norms are true, if children 
who are not too obviously maladjusted in 
Grade 3 will vary a few years later in the 
normal or expected course of their develop- 
ment, from a 5th grade level to an 11th grade 
level in achievement, it is obvious that a new 
kind of evaluation of what is “satisfactory” 
progress is required, which would probably 
lead to a radical reorganization of our tradi- 
tional concepts of grading and promotion. 

Some validation of these norms is afforded 
by the extent to which the progress norms at 
the median agree with other types of norms, 
This is shown in Chart 3. The median 1.Q. 
of all pupils was 94 and the “theoretical ex- 


_ pectation -for 1.Q. 94” is thus a better basis 


for comparison than the usual “theoretical 
norm.” It will be noticed that the slope of the 
Pso progress norm is somewhat different 
from that of the expectation for I.Q. 94. At 
younger ages, the median progress norm is 
higher than expectation but at age 14 and 
above it coincides with the expectation for 
1.Q. 94. Above 14, the progress norm is, 
however, hypothetical since we had no meas- 
ures of pupils above 8th grade. Whether’ a 
drop is inevitable at the higher ages or 
whether it is due to poorer adaptation of the 
program to pupil needs in the upper grades 
is a problem for further research. 

Extent to which grade norms are inappli- 
cable.—As an indication of how many chil- 
dren there may be whose progress or status 
makes grade norms inappropriate, we may 
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with mental age. Difference of progress norm from actual medians is due to fact that progress norms are 
based on several consecutive tests of same children. ! 
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gamine the distributions of four age groups 
at the time of their first test, as shown in 
Table II. Of the group of 91 who were be- 
tween 7 and 8 years old in September of the 
year of their first test, there were 12 whose 
grade placement was already ahead of the 
appropriate Grade 2. Of those 12, 7 had an 
educational age well above expectation (9-6 
or above); but, also, 11 of the 79 who were 
still in Grade 2 were equally accelerated in 
educational age. About 14 of the seven year 
olds were, therefore, well above second grade 
attainment levels. Of 102 eight year olds, 13 
were advanced in grade placement, of whom 
mm were also markedly advanced in tested 
achievement (educational age 10-6 or over); 
but, also, 8 of the 37 in the appropriate Grade 
3 were equally advanced in achievement— 
making a total of approximately 19 per cent 
of the eight year olds. About 15 per cent of 
the nine year age group and about 10 per cent 
of the ten year group were proportionately 
advanced. At the other extreme, 8 of the 
seven year olds (9 per cent) had tested 
achievement well below the grade expectation 
for seven year olds (below educational age 
7-0; 23 (23 per cent) of the eight year olds, 
9 (11 per cent) of the nine year olds, and 
20 (28 per cent) of the ten year olds were 
proportionately retarded. Of the total num- 
ber of 350 children in these four age groups, 
almost exactly one-third were so far above or 
below the limits of “right age for grade” 
expectation that such norms would not be 
appropriate. 

Validity of progress norms.—A check is 
required on the validity of this approach of 
evaluating progress by reference to relative 
achievement for age. How consistent is an 
individual child’s rating from test to test? 
Since these standards were derived from suc- 
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cessive tests of the same children, the progress 
trends should have somewhat greater validity 
than could be derived from standardization 
data or from any survey involving only cross 
section testing. The question remains, how- 
ever: Are the children who comprise the 
highest or lowest 10 per cent on one test the 
same children who comprise that proportion 
on the next test or on a test given four or 
five years later? 

The consistency of a child’s relative place- 
ment within his age group at successive tests 
was measured somewhat roughly by the fol- 
lowing procedure. The whole group of chil- 


test. These six levels were designated “per- 
centile levels”—1, being the lowest 10 per 
cent (below Pro); 2, the level between Pro 
and P25; 3, between P25 and Pso; 4, be- 
tween Pso and P75; 5, between P75 and 


in Table III. After the third test, the n 
of children is much reduced and 
are not reliable. It is very 


in the highest 10 per cent on 
(percentile level 6) tended approximately 
maintain that level on subsequent tests; that 


mained within the same level through four 
tests; that, in contrast to these groups, the 
two lowest levels on the first test increased 


TABLE III 


MEDIAN PERCENTILE LEVEL ON SUCCESSIVE TESTS OF PUPILS IN EACH 
PERCENTILE LEVEL ON First Test 


Percentile level Subsequent tests 
on first test X 
Srd 
6. 36 
5.61 
4.69 
3.78 
8.57 
2.74 


To be read: Pupils who were at or above 90th 
percentile level of 6.82 on the second test, 6.36 on the 
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their standing markedly on the second test 
and tended to increase it still further on sub- 
sequent tests. 

Whether this would be generally the case 
or is a consequence of special conditions 
affecting the school during this period can- 
not be said with certainty, but there is some 
evidence that the general level of achievement 
of both the best and the poorest pupils rose 
from year to year during this period, without 
greatly affecting the level of the median 
achievement. In Table IV an analysis is shown 
for pupils of each age in successive years 
(rather than for the same pupils at successive 

). The median percentile level at each 
age fluctuates from year to year without much 
evidence of any trend, but there is some sug- 
gestion of a trend for the percentage of pupils 
in percentile level 6 to increase and the per- 
centage in percentile level 1 to decrease. If 
the distribution of achievement at an age 
level remained constant from year to year, we 
should expect about ro per cent of pupils to 
be found in percentile level 6 and in per- 
centile level 1. While the trend is not clear- 
cut, it suggests that better achievement was 
secured at both extremes, in later years, 
from which one may infer that the school 
program became better adapted to individual 
needs. 


These findings suggest that the progress 
norms are fairly reliable from the 25th per- 
centile up but that they are very unstable 
below that. It would be extremely gratifying 
to believe that good teaching and individual 
attention may lift children from the lowest 
ten per cent up to the median. This may be 
true but it would not be safe to generalize so 
far from these data. It is possible that the 
transition period and the change in teaching 
methods made this particular period one of 
§ unstable accomplishment for the poorest stu- 
dents and that individual attention and good 
teaching were only compensating for previous 
neglect. The trend of the Pro and P25 lines 
in Chart 2 may, therefore, need revision to 
fit other situations or to fit this situation at 
another time. 

Need for study of individuals whose trends 
show marked fiuctwation—The progress of 
many individuals was plotted graphically in 
order to show individual trends. In general, 
very low initial scores tended to rise mark- 
edly, and very high initial scores tended to 
remain high. There were a few individuals, 
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however, who showed marked fluctuations. 
Examples of unusual progress records are 
shown in Table V. 

Pupil 1, for example, was tested first at 
the age of 7~7 in Grade 2. Educational age 
was then 8-7 (about equal to the 75th per- 
centile at this age). Gains in months of edu- 
cational age for successive years were 18, 22, 
33 and 12—a total of 85 months in 48, with 
the gain at the fourth test more than two and 
one-half times as much as the gain at the 
third in terms of educational age. In Grade 
6, at the age of 11-7, this child’s status was 
15-8—far above the goth percentile for his 
age. His 1.Q. also rose during this time from 
114 to 158, but the use of different tests and 
perhaps different examiners makes I.Q. com- 
parisons of doubtful validity. 

The questions which are thrown into relief 
by this type of analysis are: Is a shift of this 
sort explicable by factors that can be iso- 
lated? Does it represent more than unreli- 
ability of test results? Is it the result of the 
kind of educational program in which the 
child has been placed; or does it reflect a 
particularly favorable home environment; or 
is it an unpredictable result of internal 
growth factors as yet unknown or unanalyz- 
able? The school should be able to provide 
data that might throw some light on these 
questions. 

A different picture is given by Pupil 6—a 
child overage in Grade 3, with an educational 
age of 7—9 at the age of 10-6. From Septem- 
ber (age 10-6) to June (age 11-3) in Grade 
3, this child actually lost 11 months in tested 
achievement, but made a double gain the next 
year. In spite of gains of almost 12 months 
a year for the next two years, his educational 
age at age 14-3 was only 10-6 and he re- 
peated the 6th grade, completing it at age 
15-3 with an educational age of 14-4—hav- 
ing progressed 46 months in the year! At 10, 
this child was below the roth percentile for 
his age; at 14 he had almost reached the 
25th percentile; at 15 he had reached the 
50th percentile for his age in spite of the fact 
that his grade placement was then three 
years below the “right grade for age.” 

The school ought to be able to say 
whether this child’s loss in educational age 
in the 3rd grade indicated actually lower 
efficiency; whether it was correlated with a 
lack of good teacher-pupil relationships, with 
illness or poor attendance, with home condi- 
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tions which inhibited his emotional develop- 
ment or failed to provide physical essentials 
for growth, or with other factors that might 
contribute to an interpretation; and whether 
his later remarkable gains were related to 
other ascertainable factors influencing growth. 

In spite of the phenomenal gains of some 
individuals, however, others showed little gain 
and, therefore, decreased their percentile 
level, or relative placement in their age group. 
Pupil 7, for example, had a status just below 
the 25th percentile on the first test but was 
below the roth percentile the following year 
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and did not regain his initial position through 
the 6th grade. Pupil 8 fluctuated around the 
roth percentile level for five years, gaining 
as much as 11 months in educational age in 
Grade 3, but actually losing 2 months in 
educational age upon repeating Grade 5. 
Pupil 11 was above Pso on the first test but 
lost considerably the next year and did not 
regain his percentile placement the third year 
although he made 15 months gain in 12. 
Pupil 10 was above the goth percentile in 
Grade 4 (with an L.Q., interestingly enough, 
of 97) but apparently merely marked time 


TABLE VI 


TENTATIVE AGE PROGRESS PERCENTILE NORMS FOR EDUCATIONAL AGE ON 
STANFORD ACHIEVEMENT TESTS 
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the next year, so that his percentile placement 
was between 75 and go, where it remained 
the following year. 

‘Studying individual fluctuations in this 
way should make the school better able to 
relate factors that affect individual growth 
(such as nutrition, home routine, teacher 
effectiveness, pupil relationships making for 
security or insecurity, satisfaction or frustra- 
tion, etc.) to individual pupils and so achieve 
better guidance for individuals. 

The progress norms here given are of lim- 
ited application because of the special condi- 
tions under which they were derived. If the 
method were applied to a large and unselected 
sampling of age groups over a period of years, 
it would provide a stable standard of growth 
to be expected in tool subjects at various 
levels of the range of achievement. As grad- 
ing becomes more flexible and materials of 
instruction better adapted to the wide range 
of individual differences, such a method 
would make it possible to have a better guide 
for evaluation of pupil growth in the tool 
subjects than we now have. 


Part II 


APPLICATION OF AGE PROGRESS PERCENTILE 
Norms To SOME PROBLEMS oF EVALU- 
ATION OF THE SCHOOL PROGRAM 


The primary purpose of the age progress 
percentile norms presented in Part I is to 
afford a better criterion of individual progress 
in elementary school subject skills than is 
provided by the usual type of norm. At the 
same time, the measurement of individual 
pupil progress in this way permits certain 
kinds of evaluation not easily made by the 
usual use of norms. In this section, the per- 
centile norms will be used to illustrate their 
applicability to the evaluation of school grad- 
ing, school marks, and teacher effectiveness, 
in the community supplying the data from 
which the technique was developed. 

Percentile level for age and grade place- 
ment.—Since the percentile level is an 
achievement level relative to age, regardless 
of grade placement, the first question of in- 
terest is: At what grade levels were the vari- 
ous percentile levels at each age found? Since 
the extremes of the distribution contain few 
cases, the two lowest percentile levels (Pc- 
level 1 and Pc-level 2) wete combined, 
and also the two highest levels (Pc-level 5 
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and Pc-level 6), thus dividing the achieve. 
ment distribution of an age group into quar. 
ters. Table VII shows the surprising extent to 
which each percentile level of each age is dis. 
tributed through different grades, in spite of 
the rather high homogeneity of each level in 
both chronological age and achievement. It 
will be seen that with two exceptions (lowest 
quarter of nine year olds, involving only 4 
cases, and lowest quarter of ten year olds, in. 
volving 11 cases), each quarter of the 
achievement distribution at each chronolog. 
ical age was found in at least three different 
grades, and in some cases at five different 
grade levels. What meaning can be given to 
Grade 4, for example, when it contains some 
nine year olds below the median achievement 
of nine year olds and some twelve year olds 
above the 75th percentile for twelve year 
olds? Or to Grade 5, which has chronological 
ages from g through 13 and achievement 
levels ranging from the 25th percentile level 
of ten year olds to above the 75th percentile 
level of thirteen year olds? Why are twelve 
year old pupils whose achievement is above 
the 75th percentile for twelve year olds scat- 
tered from Grade 4 through Grade 8? How 
can a course of study for Grade 4 be adapted 
to every level of achievement at every age 
from 9 through 12? These questions become 
very bewildering when it is recalled that the 
lowest quarter at age 9 did not exceed what 
is called a grade level of 2.9 and the highest 
quarter at age 13 was above a so-called grade 
level of 8.1 (see Chart 2 in Part I). 


In general, the grade placement of the 
lowest level had a narrower range than that 
of the highest level. From Table VIII, which 
indicates the grade having the highest fre- 
quency at each achievement level, it will be 
seen that Grade 6 is the modal placement for 
the highest achievement level at age 11, but 
35 per cent of the pupils having this level of 
achievement (as shown in Table VII) were 
below Grade 6. From the modal placement, 
it can be seen that while the difference in 
modal grade placement between the highest 
and lowest quarters of achievement at age 9 
is one grade, it extends to three grades by age 
13. There was evidently only a very loose 
relationship between grade placement and 
achievement level for age. It is clear that a 
grade here did not represent clearly either a 
well-defined level of achievement or a narrow 
chronological age range. 
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TABLE VII 


PERCENTAGE DISTRIBUTION OF PERCENTILE ACHIEVEMENT LEVELS THROUGHOUT THE GRADES, AT 
EACH CHRONOLOGICAL AGE From 9 THROUGH 13 


Percentile 
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TABLE VIII 
MopAL GRADE PLACEMENT FOR EACH PERCENTILE ACHIEVEMENT LEVEL OF EACH AGE 


Percentile level for age 


Percentile level and ability groups ——How- 
ever, ability grouping was practiced in this 
school and the grade placement might be 
much less important than the grouping within 
the grade. Table IX shows the percentile of 
each ability group of Grades 3-8 which is in 
the various percentile levels. As in Tables VII 
and VIII, the two highest and the two lowest 
percentile levels are combined, thus dividing 
the pupil distribution into four quarters of 
achievement. In almost every section of every 
grade all achievement levels are represented. 
The only exceptions are that in the A sections 
of Grades 5 and 8 there are no pupils in the 
lowest quarter of achievement and in the C 
sections of Grades 6 and 8 there are no pupils 
in the highest quarter of achievement. There 
is, however, a fairly strong tendency, in gen- 


Grade with highest percentage 


10 11 12 


eral, for the lowest quarter of achievement for 
age to be in C sections and for the highest 
quarter to be in A sections. The second quar- 
ter is nearly equally divided between B and 
C sections and the third quarter nearly 
equally between B and A. While the ability 
grouping is not extremely clear-cut, it seems 
to make a better distinction regarding 
achievement for age than does grade place- 
ment. 

Percentile level and 1.:0—A word may be 
said about the relation of achievement level 
for age and ability level as measured by in- 
telligence tests. When the average percentile 
level for age over a period of years is com- 
pared with the average I.Q. (most pupils had 
two to four intelligence tests during the 
period), the relationship is on the whole very 
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TABLE IX 


PERCENTAGE DISTRIBUTION OF PERCENTILE 
ACHIEVEMENT ACCORDING TO 
ABILITY SECTION OF GRADES 


Grade Percentile level Ability section 
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close, as one might expect. The correlation 
is r = .821 + .010. The extent of variation 
at each I.Q. level is shown in Table X. One 
of the most interesting things about this table 
is that about 12 per cent of pupils in the 1.Q, 
range 70-89 attained an achievement level 
above the median for their age group, and 
half of them were above the lowest quarter 
of achievement for age. Achievement at this 
level of intelligence would appear to be rather 
unpredictable. This is a finding which checks 
with clinical observation. The “borderline” 
group, as measured by intelligence tests, that 
is, the area below average but not clearly 
feebleminded, is a group that cannot be 
pigeon-holed. Development may depend on 
other factors than “intelligence,” which, at 
this level, are more potent. At I.Q. levels 
below 70, however, achievement was generally 
in the lowest quarter while at I.Q. levels 
above 110 it was most frequently in the 
highest quarter and in almost all cases in the 
upper half. From the relation between abil- 
ity and achievement as indicated by these 
tests, it may be inferred that school attain- 
ment in this community was at least as high 
as and probably a little higher than is found 
on the average. 

Percentile level of repeaters —During the 
period under consideration 57 pupils (out of 
447) at some time repeated a grade. Approx- 
imately equal numbers of repetitions occurred 
in each grade from 3 to 6; none was below 
the 3rd grade and only 4 were above the 6th 
(3 of Grade 7 and 1 of Grade 8). The per- 
centile level distribution of these pupils at the 
time they failed of promotion and in the fol- 
lowing year is shown in Table XI. Eleven of 
them (19.3 per cent) were above the median 
achievement level for age at the time of re- 
peating. Twenty-two (38.6 per cent) were in 


TABLE X 


PERCENTAGE DISTRIBUTION OF PERCENTILE ACHIEVEMENT LEVELS, ACCORDING 
TO 1.Q. CLASSIFICATION 


Average’ percentile level for age 


Average’ I. Q. level 
90-109 
11.8 
40.4 
39.4 
8.5 
188 


110+ 


1 For each pupil an average percentile placement and an average I. Q. were calculated, based on all 
available measures during the years in which he was tested. 
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TABLE XI 
PERCENTILE LEVEL OF REPEATERS IN YEAR PRECEDING AND FOLLOWING REPETITION 


Percentile level Number Percentile level second year Number of cases 
of cases 1 2 8 4 5 6 Lower Same Higher 
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the lowest quarter. The majority (60 per 
cent) were in the middle half of achievement 
for their age. The reason for repetition could 
not, therefore, have been low achievement 
alone. It may have been thought that certain 
pupils could improve their achievement, how- 
ever, by repeating. What effect did repetition 
have on their achievement level? About a 
quarter of them (14 pupils) attained a higher 
percentile level; over 60 per cent (35 pupils) 
remained within the same level; and 14 per 
cent (8 pupils) were lower after repeating. 

Considerable fluctuation may occur within 
the same percentile level and a more refined 
measure of gain was thought to be needed. 
A gain ratio was used which indicates the rate 
of gain relative to the time elapsed 





(= in months of educational age 
Number of months between tests X 300 


A rate of 100 indicates “normal” expectancy; 
150, a gain at 1.5 times the normal rate; 
50, a gain of only half the normal rate. The 
gains after repeating were definitely higher 
than the gains before repeating for pupils at 
percentile level 3 or above, but lower after 
repeating for pupils below percentile 3 


TABLE XII 


MEDIAN RATE OF GAIN BEFORE AND AFTER 
REPETITION AT EACH PERCENTILE LEVEL 


Median rate of gain’ 
Percentile level 


Note: Based on 46 of the 57 cases for whom 
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— of 1 months" gain in educational age in 
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(Table XII). The number of cases involved 
is, of course, too small to be more than sug- 
gestive, but the distribution of gains before 
and after repeating (Table XIII) at least con- 
tains the suggestion that when fairly compe- 
tent pupils who were gaining slowly repeated 
a grade, they were somewhat stimulated, but 
when pupils at the lower levels of ability were 
required to repeat, the effect was just as likely 
to be to retard progress. 


Percentile level and rate of gain in Grades 
3-8.—The rate of gain made by a pupil in 
any grade may be indicative, among other 
things, of whether the curriculum of that 
grade is appropriate to his level of achieve- 
ment at that time. The average percentile 
level of a pupil over a period of years is prob- 
ably as valid an indication of general ability 
as 1.Q., as has been shown, and should on 
logical grounds certainly be more predictive 
of school progress. There are several impli- 
cations that may be drawn from the median 
gains in different grades of pupils at different 
achievement levels for age (Table XIV)..In 
the third grade, pupils in the highest quarter 
of achievement for age gained at more than 
the “normal” expectation but at about the 
same rate as pupils whose achievement was 
in the lowest quarter for their age. Pupils in 
the highest quarter continued to increase 
their rate of gain through the 6th grade, to a 
point more than half again as fast as the 
“normal” rate, after which the rate decreased 
to about “normal” in Grade 8. In the lowest 
quarter the rate dropped from 120 in Grade 
4 to 66 in Grade 5, recovered a little in Grade 
6, and then dropped rapidly to a rate of 38 
in Grade 8. The middle half of the achieve- 
ment for age distribution had the most rapid 
gain of any level in Grade 3, after which it 
varied only slightly above or below 100, ex- 
cept in the 7th grade where the gain was 
only 71. These facts suggest that the subject 
matter in Grade 5 and Grade 7 may have 
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been particularly difficult for pupils who were 
below the top quarter of achievement for 
their age; that pupils in the lowest quarter 
of achievement for age found progress very 
difficult above Grade 6; and that those in the 
top quarter began to rest upon their laurels 
by the 7th grade and had considerably re- 
laxed their effort by Grade 8. This type of 
progress is perhaps almost inevitable under 
the usual school program, which, with its 
emphasis on grade organization, probably can 
not provide a very satisfactory setting to en- 
courage the optimum growth of individual 
pupils in elementary subject skills. 

Percentile level and school marks.—Since 
pupils of various ages and various levels of 
attainment were found in the same grade and 
in the same ability section of the grade, the 
question naturally arose as to how their 
school marks reflected their varying status. 
The correlation of percentile level with school 
marks for all pupils combined is much lower 
than with 1.Q., although one might expect 
that school marks should reflect achievement 
more closely than I.Q. would. The correlation 
of percentile level with school marks is 7.’ = 
495 + .o12; with 1.Q., 821 + .o1ro. In 
Table XV the median percentile level for age 
of those who received each kind of school 
mark in each grade is shown. The evidence of 
increasing pressure for higher attainment 
through the grades is clear. In Grade 2, for 
example, the median achievement level for 
age of pupils who received school marks of 
B was about half way between the 25th and 
50th percentiles of the distribution of achieve- 
ment for age (percentile level 3.5), while in 
Grade 8 the median of pupils receiving marks 
of B was above the 75th percentile of achieve- 
ment for age (percentile 5.2). In Grade 8 
the median achievement level for age of 
pupils who received failing marks was as 
high as the median for age of pupils in Grade 
2 whose marks were B. 

If one translates a situation like this into 
terms of the effect upon an individual child 
in his progress through the grades, it is rather 
startling. What effect must it have upon a 
child’s ambition, interest in school, and gen- 
eral morale, if, with a level of capacity at 
about the 30th or 40th percentile for pupils 
of his age, and an achievement status which, 
objectively measured, keeps pace with that 
level, his success with school work is reported 

17,, calculated by the method of contingency. 
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TABLE XIV 


MEDIAN RATE OF GAIN IN GRADES 3 TO 8 MADE BY PUPILS AT HiGH, MEDIUM, AND Low 
PERCENTILE OF ACHIEVEMENT FOR AGE 


Average percentile level for age* 
Highest quarter Pe. ow 5-6) 
Middle half (Pe. 1 3-4) 
Lowest quarter (Pc. level 1-2) 


Grade 


7 5 6 7 8 
134 140 161 131 104 
110 94 107 71 104 
120 66 95 31 88 


* Each pupil was classified only once, py the average percentile level attained on all tests. The rate 


of gain was the rate made in the grade indicated. 


TABLE XV 


MEDIAN PERCENTILE ACHIEVEMENT LEVEL FOR AGE OF THOSE RECEIVING EacH KIND OF 
ScHOOL MARK IN EAcH GRADE 


School marks 
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to him as deteriorating from a mark of B in 
Grade 2 to C or C+ in Grade 3, C— in 
Grade 4, and D or F from Grade 5 on? To 
be sure, there was little difference in the 
actual tested achievement level for age, at 
Grade 6 and above, between those who re- 
ceived marks of D and F and those who re- 
ceived marks of C— and C. It may be that 
those who received marks of C in Grades 6, 
7 and 8 were older than those who received 
D and F and so had higher educational ages 
and could deal better with the curriculum 
level of those grades. But to the child whose 
marks are failures in spite of a continued 
normal level of effort (and a consistent level 
of actual attainment), this would be small 
comfort. 


In defense of the situation in this commu- 
nity, however, it should be pointed out that 
few failing marks were recorded, only 42 out 
of 577 pupils receiving marks below C 
(Table XVI). Furthermore, of those whose 
marks were failing, about a quarter had a 
percentile level for age at or above the median 
(level 4 or higher). This suggests that fail- 
ing marks may have been used as a spur to 
greater effort. Table 16 shows that, on the 
whole, school marks tended to be fairly re- 
lated to achievement level for age, although 
a few pupils at the lowest percentile levels 
received marks of A and B, and a few at the 
highest levels received failing marks. 
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TABLE XVI 


PERCENTAGE DISTRIBUTION OF PERCENTILE 
ACHIEVEMENT LEVEL, ACCORDING TO 
KIND OF SCHOOL MARKS RECEIVED 


School marks 
B 


Percentile level om, C,C+ B+,A 


70.0 
18.3 
10.3 

1.4 


87 


Further evidence of the confusion in marks, 
grade placement, and tested achievement 
status for age may be shown by an analysis 
of nine and ten year old children who were 
in Grades 3 to 6 and who received marks of 
A, B+ and B (Table XVII). Only pupils who 
entered school after the new program was 
adopted were included. Table XVII-A shows 
percentages of each percentile level in each 
grade receiving above average marks, and 
Table 17-B indicates the numbers upon 
which the percentages are based. The num- 
bers, of course, are too small to yield satis- 
factory reliability, but the percentages never- 
theless raise some interesting questions. For 
example, 80 per cent of the 10 ten year olds 
at percentile levels 5 and 6 who were in Grade 
4 received marks of A or B, but only 2.6 per 
cent of the 30 ten year olds at the same level 
of achievement who were in Grade 5. The 
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TABLE XVII-A 


Per CENT OF PUPILS oF EACH See ann a ea et ne one we 
RECEIVED Marks or A, B+ AnD B 
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percentile levels 5—6 and in Grade 8 received 
ntile level who were in Grade 4 received 
level in Grade 5, etc. 


TABLE XVII-B 
NUMBER OF PUPILS OF EACH PERCENTILE LEVEL AT AGES 9 AND 10 IN GRADES 83 TO 6 
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reason for this is probably as follows: The ten 
year olds of this level in Grade 4 are below 
the modal grade placement for this percentile 
level of ten year olds and, therefore, excel in 
competition with the lower levels of ten year 
olds, but those in Grade 5 are at the modal 
grade for their percentile level and are com- 
peting with ten year olds of the same ability 
and with eleven year olds of average ability 
or more for eleven. Although it may be 
understandable how such a condition may 
arise, it does not reduce the .confusion nor 
does it make the meaning of school marks any 
clearer to the pupil who receives them. It is 
difficult to see how the matter of school marks 
can be much clarified, however, until the 
confusion of the grade organization is re- 
solved. Reporting a pupil’s success or failure 
with school work would be more logical and 
more meaningful to the pupil and his parents 
if it were referred to the rate of progress that 
he, individually, should be expected to make. 
For each child, his relative placement within 
his age group and the rate of progress usually 
attained at that placement level would pro- 
vide a more tangible goal and a more realistic 
incentive. 


rise only pupils who were 9, 10 or 11 old when the record ended and, there- 
a ler the new program has been — 


Percentile level and teacher effectiveness. — 
The growth made by the pupils in a teacher’s 
class is one of the best measures of the 
teacher’s effectiveness—or would be if we 
could have valid and reliable measures of 
growth as it is influenced by the teacher. 
There are not yet any satisfactory objective 
measures for growth in social and emotional 
maturity and total integration of personality. 
These are important aspects and may pos- 
sibly be stimulated by some teachers whose 
pupils show less academic gain than others. 
Even in the tool subject skills, however, the 
use of pupil gains as measures of teacher 
effectiveness is complicated by so many 
factors that it has not been found entirely 
satisfactory. 

In this study, we attempted to keep con- 
stant the factors of variation in age grade 
status and in ability by using the percentile 
level for age as the basic reference point. It 
is believed that the way in which a pupil is 
found to gain, with reference to the level of 
his achievement relative to his age rather 
than to his grade placement, indicates 
whether the teacher has had an effect on that 
pupil, regardless of whether he is in the right 
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grade for his age or the right section for his 
abilit 

Shick tastier date tomm anieal Sees 
and from different classes might obscure cer- 
tain trends, if there should be any, which 
would be important for interpretation. Sev- 
eral preliminary anal were, therefore, 
made. One question was whether the quality 
of teaching varied from year to year. In each 
year of the experiment each teacher was rated 
by the Superintendent on a subjective scale 
from 1 to ro. Sixty ratings of classroom 
teachers in Grades 2 through 6 were made in 
the five years of the experiment. In Grades 3 
to 6, on which most of the following analyses 
are based, there were 54 ratings. The distri- 
bution of ratings indicated that ratings of 8, 
g, and 10 could be called superior or good; 
ratings of 5, 6, and 7, satisfactory or average; 
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Grade 


It is evident, therefore, that B sections were 
at some disadvantage, compared with A and 
C sections, in regard to teacher personnel. 

A third factor that needs to be mentioned 
is the continuity of teaching. There were 25 
different teachers in Grades 3 to 6 during the 
five years. Of these, 8 taught only one year, 
9 two years, 5 three years, 2 four years, and 
1 five years. In general, when a teacher 
taught more than one year she continued 
with the same grade and ability section and 
in general, ratings improved the longer teach- 
ers remained, as is indicated in Table XVIII. 


TABLE XVIII 


AVERAGE RATINGS OF TEACHERS WITH VARYING LENGTH OF SERVICE 


Number of 
teachers 


8 
9 
5 
2 
1 


Number of 
years in position 


and ratings of 1, 2, 3, and 4, inferior or poor. 
Of all the ratings, 35 per cent were superior, 
36.7 per cent average, and 28.3 per cent in- 
ferior. Proportions fluctuated in the various 
years but no consistent trend was apparent. 
The average rating of teachers in Grades 3 
to 6 in each of the five years of the experi- 
ment were, respectively, 6.3, 6.5, 6.4, 5.6, 7.0. 
No particular trend was, therefore, observable 
which would unduly influence results from 
year to year. 

A second factor that might obscure certain 
effects in combining data is that of whether 
good and poor teachers were equally distri- 
buted in grades and ability sections. It is evi- 
dent that the poorest teachers were more 
likely to be assigned to the average ability 
groups and the best teachers to the high abil- 
ity groups, though there were some excep- 
tions: 62 per cent of the best teachers were 
with A sections and 67 per cent of the poorest 
teachers were with B sections. The average 
ratings for four years (excluding the first, 
when the grades had only two ability groups) 
in different ability sections were as follows: 


rating in each year 
5th 


A fourth factor is, of course, the subjectiv- 
ity of the ratings and the possibility of a shift 
in subjective standards from year to year, 
from grade to grade, or from teacher to 
teacher. We have no way to evaluate this 
factor except the gains made by pupils, and 
these admittedly give only a partial picture. 

Rate of gain made by pupils under good, 
average, and poor teachers—Combining 
pupils of Grades 3 to 6 inclusive, but keeping 
percentile achievement levels for age distinct, 
it is apparent that at every level of achieve- 
ment pupils gained less rapidly under poor 
teachers than under average or good teachers. 
The median gains were less and the upper 
and lower quartiles of the gains were less 
(Table XIX). Differences between good and 
average teachers were not so clear. At the 
lower achievement levels (percentile levels 1 
and 2 and percentile level 3) good teachers 
secured larger gains than average teachers; 
but at the higher achievement levels (per- 
centile level 4 and above) the reverse was 
true. There does not appear to be any logical 
explanation of this, other than that there 
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Good Average Poor 
94 
187 
47 


132 
180 
86 


Ta 
125 
185 

89 


All pupils 


T 


119 
144 
65 


5 and 6 


Teacher rati 
Good a ee 
154 
200+ 

111 


Percentile levels 
143 
186 
100 


97 
137 
62 


4 
Teacher rati 


Good Average "Dene 
140 
200+ 
83 


Percentile level 


121 
172 
76 


75 
131 
33 


3 
Teacher ratin 


Good Average a 
113 
168 
70 


Percentile level 
Poor 
100 128 
131 182 
30 81 


1 and 2 
Teacher rating 


Good Average 
106 
155 
56 


Percentile levels 


108 
180 
64 


Rate of 
gain 
"gi = 
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were too many complicating conditions to 
make trends distinct. It seems even more 
illogical when it is remembered that pupils at 
higher achievement levels were more gener. 
ally in the A sections and the A sections were 
more often taught by teachers with high 
ratings. No doubt, one factor is the subjec. 
tivity of the ratings: it is easier to discover 
the poorest teachers than the best.’ The 
effect of teachers on pupils is, however, better 
differentiated by using the classification of 
percentile achievement level for age than by 
any other combination of age and grade 
status, and certainly is superior to throwing 
together all pupils under one type of teacher, 
since the various kinds of teachers did not 
have equivalent distributions of pupil attain- 
ments. (For example, 12.7 per cent of all 
pupils taught by poor teachers were in per- 
centile levels 1 and 2, compared with 23.8 per 
cent for average teachers; but only 16.6 per 
cent of pupils under poor teachers were in 
percentile levels 5 and 6, compared with 42.7 
per cent of pupils under good teachers.) 
Case studies of individual teachers —With 
so small a number of teachers who remained 
for more than one year, the relation between 
rating and class gains cannot be very reliably 
determined. Certain cues may be had froma 
consideration of those who were rated con- 
sistently high or low or changed from one 
classification to another. Table XX shows, for 
each of the ten teachers who were classifiable 
in this way, the successive ratings, the grades 
taught, the median achievement level of the 
classes, and the median rates of gain of the 
classes. Teacher L, for example, was rated 8 
for the first three years and 1o for the next 
two. While teaching first a 4B section whose 
median percentile level was 3.1 (that is, 
barely above the 25th percentile level of 
achievement for age), she secured a median 
gain rate of 113 (13 per cent above the 
“normal” expectancy of one grade per year). 
After that she taught 4A sections with per- 
centile levels above 4 and secured gains 
above 120. In one year, a class with a median 
percentile level of 5.4 (above the 75th per- 
centile for age) made a median gain of 163 
(nearly two-thirds of a grade more than 
“normal’’ progress). This is a consistent pic- 
ture of good teaching, although the rate of 
gain would naturally be above “normal” 
2A number of studies have shown greater consistency among 


different judges in rating teachers poor than in rating them 
superior. 
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when the class level of achievement was also 
high. There are only three conspicuous in- 
stances of inconsistency in Table XX. Teacher 
W, rated 1o in the third year, taught a class 
with an extremely high level of achievement 
for age but secured a median gain rate of 
only 85. Teacher X, who had for two years 
been rated as a superior teacher, was rated 
only 6 the third year, although in this year 
she taught a 6B class below the median 
achievement level for age (percentile level 
3-4) and secured a median gain of almost 
three and one-half times the normal rate. 
One wonders if these extraordinary gains were 
made at a sacrifice of other values which 
might have been evident to the Superin- 
tendent but do not show in the record. The 
third inconsistency occurred with Teacher I, 
whose class made a gain of over 200 per cent 
in the year that she was rated 6 and in the 
following year a gain of 113 when her rating 
jumped to 9, the achievement level of both 
classes being about the same. With these ex- 
ceptions, the ratings are reasonably consistent 
with the gains made by pupils, taking into 
consideration the potential gains to be ex- 
pected from the percentile level. As is appar- 
ent from Table XIX, however, there was a 
wide range of pupil gains at every level of 
achievement and under all types of teachers, 
indicating that many factors enter into the 
problem of influencing pupil growth even 
when this is narrowly defined as growth in 
the tool subject skills. 


Variation of gains under good, average, 
and poor teachers—The variation may be 
expressed in another way. Of the 45 class 
groups for whom gains could be measured, 
8 made gains of 150 per cent or more, 30 be- 
tween 75 and 150, and 7 less than 75. Of the 
8 groups making exceedingly high gains, 3 
were taught by superior teachers, 5 by aver- 
age teachers, but none by poor teachers. Of 
the 30 groups making average gains, 12 were 
taught by superior teachers, 11 by average 
teachers, 7 by poor teachers. Of the 7 groups 
making extremely small gains, 2 were taught 
by superior teachers, 3 by average teachers, 
2 by poor teachers. It might, therefore, be 
said that at least an average teacher was re- 
quired to produce very large group gains, but 
that small group gains might be made even 
under superior teachers under certain cir- 
cumstances. 


88, 115, 94, 156 
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Validity of gains as a measure of teacher 
effectiveness—More convincing evidence of 
the relation of gains to teacher effectiveness 
is given in Tables XXI and XXII, where the 
difference between the rate of gain made un- 
der any teacher and the total rate of gain for 
the same pupils is shown. None of the classes 
under poor teachers had a median gain in ex- 
cess of the final rate for the same pupils, and 
the median was 18.7 points below the total 
rate for the same pupils. It is interesting, 
however, that about a third of the teachers 
who had superior ratings failed to maintain 
the rate of gain that the same pupils made 
for the whole period, and also that the Jargest 
superiority was found under teachers regarded 
as average. It is possibly an open question 
whether the ratings or the relative gains (at 
least in the case of relatively large losses) are 
the more valid indication of teacher effective- 
ness on the growth of pupils. When all 
classes of a given teacher are combined and 
median gains of all pupils taught compared 
with final median gain for the same pupils, 
we still find one teacher rated good whose 
pupils failed to maintain their general rate 
of gain by 75 or more points, and one teacher 
rated average whose pupils exceeded their 
general rate of gain by more than 100 points 
(Table XXII). It is not easy to think of other 
values of teaching which would justify rating 
a teacher as superior if her pupils consist- 
ently failed to progress in tool skills at the 


TABLE XXI 


Excess OR DEFICIENCY IN MEDIAN RATE OF 
GAIN OF CLASSES UNDER GOOD, AVERAGE, 
AND Poor TEACHERS COMPARED WITH 
FINAL MEDIAN RATE OF GAIN 
FOR THE SAME PUPILS 


Teacher rating 


Excess points! Average Poor 


Good 
shea 2 eee 


’ 
“A eR NN IW: 
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_ 


9 
—18.7 
in for 
for whole 


19 

17.5 
1 Difference between class median 

and total median gain of same pup 


2 These are medians of class medians. 
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TABLE XXII 


Excess oR DEFICIENCY IN MEDIAN Rate op 
GAIN OF ALL PuPILs UNDER Goon, 
AVERAGE, AND TEACHERS 
COMPARED WITH FINAL RATE 
or SAME PUPILS 


Average rating for teacher 
Good Average Poor 
enn 1 


Excess points! 
100+ 


2 wsad 
ones 2 


8 3 
10.0 —31.3 


in of all pupils 
median gain of 


! Difference between median 
under a given teacher and to 
same pupils for whole period. 


rate they did in other classes. It is more 
easily conceivable that an average teacher 
might secure consistently large gains by strict 
disciplinary drill. This kind of measure 
would in either case, however, afford a check 
on subjective ratings which should help to 
make such ratings more searching. 


SUMMARY 


In Part I a technique was developed for 
measuring the progress of pupils as to their 
relative status in educational achievement 
with reference to their age. This technique 
consisted in defining the progress made by the 
same pupils, in educational age on the Stan- 
ford Achievement Test, at the roth, 25th, 
goth, 75th, and goth percentiles of educa- 
tional age distributions at successive chrono- 
logical ages, and then defining a pupil’s status 
for his age as in one of six so-called percentile 
levels of achievement for age, these six levels 
being divided by the percentile points men- 
tioned. 

The technique was developed from data 
involving consecutive annual testing over a 
five year period in a community in which the 
average ability was slightly below the gen- 
eral average, while the average achievement, 
though below the “norm,” was above the ex- 
pected level for ability. The tentative age 
progress percentile norms developed are 
probably not widely applicable, but they are 
valid and useful within the community and 
could be revised to be applicable to any given 


population group. 
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Within the limits of the data, it was shown 
that while the progress of pupils at the soth 
tile level of achievement for age was 
approximately a straight line, progress at the 
lower levels was at a decreasing rate. At 
higher levels the tendency was not so clear 
but there was some evidence of an accelerat- 
ing curve. The result is that while the differ- 
ence between the roth and goth percentiles 
at age 7 is about 3 years in educational age, 
at age 15 the difference is about 7 years. 
A pupil maintaining progress at the goth per- 
centile between age 7 and age 15 would make 
g years progress in 8, but a pupil maintaining 
progress at the roth percentile would progress 
only 544 years in 8. 

It was indicated how this technique could 
be used to appraise the progress of pupils, 
especially those for whom grade norms are 
inapplicable (comprising about a third of the 
pupils in this study), and it was shown that 
such appraisal clarifies some of the problems 
about which schools need information for 
better pupil guidance. 

In Part II these percentile levels were used 
to evaluate grade placement, school marks, 
and teacher effectiveness in the same com- 
munity. 

It was shown that grade placement had 
very little relation either to the status of 
achievement for age or to chronological age, 
that children of the same age and the same 
level of achievement were found at as many 
as five different grade levels and that the 
higher levels of achievement at any age 
tended to be distributed over a wider range 
of grades than the lower. 

It was also shown that pupils of all levels 
of achievement for age were found in all 
ability sections of almost every grade, 
although not in the same concentration. 

Although about 1 pupil out of 8 repeated 
a grade during the five years covered by the 
data, more than three-fifths of those who re- 
peated were above the median achievement 
for their age at the time they repeated and 
only about one-fourth of them raised their 
percentile level appreciably by repeating. 
There was a differential effect of repetition 
upon pupils of different status, however, 
those above the lowest quarter making greater 
gains upon repetition, while those in the 
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lowest quarter gained less rapidly when re- 


ting. 

The rate of gain at different levels of 
achievement for age in Grades 3 to 8 indi- 
cated the increasing pressure in the higher 
grades and the decreasing tempo of the lower 
achievement levels. Whereas the lowest 
quarter for age in Grades 3 and 4 gained in 
tested achievement at a rate 20 per cent 
above the “normal” rate (one grade per 
year), the lowest quarter for age in Grades 7 
and 8 made only about one-third of the 
normal progress expected. 

An extraordinary lack of relation was re- 
vealed between the achievement level for age 
and the quality of school marks. The same 
relative achievement status for age which 
rated a mark of B in Grade 2 was rated 
failure in Grade 8. To get a mark of B+ in 
Grade 2 required, on the average, that a 
child should be just above the median 
achievement level for his age, while to get 
B+ in Grade 6 or above required the child 
to be above the goth percentile of his age. 

It was shown that the gains made by pupils 
of different levels of achievement for age 
could be used as one way of evaluating 
teacher effectiveness. By this method, teach- 
ers who were rated on a subjective scale as 
below average were much more sharply dis- 
criminated from average teachers than were 
average teachers from superior teachers. An 
interesting difference between teachers rated 
superior and average was, however, found. 
The gains made by pupils below the median 
of achievement for their age were higher 
under superior than under average teachers, 
but at levels above the median pupils made 
larger gains under average teachers. Why this 
was so cannot be determined from the data 
at hand. ' 

The study as a whole calls attention to the 
differential rates of progress made by pupils 
whose achiévement varies with respect to 
their chronological age, and indicates how 
failure to consider these different rates of 
progress in our usual framework of grading, 
promotion, ability grouping, and school 
marking creates a confusion that makes 
evaluation very difficult and greatly handi- 
caps the attainment of adequate pupil 
guidance. 





A WORK SHEET FOR THE JOHNSON-NEYMAN TECHNIQUE 


R. L. C. Butscu 


Stevenson, Jordan & Harrison 
Chicago, Illinois 


I. ILLUSTRATING THE USE OF 
THE WORK SHEET 


THE PURPOSE OF THE JOHNSON— 
NEYMAN TECHNIQUE 


The Johnson—Neyman Technique is a very 
useful procedure for determining the signifi- 
cance of the difference between two groups 
of individuals on one variable, when two 
other variables are held constant by statis- 
tical methods. The technique was first intro- 
duced in 1936 by Palmer O. Johnson and J. 
Neyman,’ and has been used as a very effec- 
tive statistical approach in a wide variety of 
educational problems. In a recent article 
Koenker and Hansen* have furnished a sam- 
ple analysis demonstrating the steps of pro- 
cedure when a calculating machine is avail- 
able. The present article is a re-analysis of 
the technique to make possible the use of 
logarithms. It is felt, however, that even if 
a mechanical calculator is used, the present 
arrangement will greatly simplify the process. 

A good illustration of the application of 
this technique is a recent study by Treacy,’ 
in which he used two groups of pupils differ- 
entiated on the basis of achievement in Prob- 
lem Solving in arithmetic. The upper third 
of the group were designated “good achiev- 
ers”, and the lower third “poor achievers”. 
The pupils were then measured on fourteen 
different reading skills. The problem was to 
determine on which reading skills there ex- 
isted a significant difference between good 
and poor achievers in Problem Solving in 
arithmetic, when mental age and chronolog- 
ical age were held constant. Another similar 
problem is that mentioned by Koenker and 
Hansen‘ “comparing 90 excellent achievers in 
division and 90 poor achievers in division on 


1 Palmer O. Johnson and J. Neyman. “Tests of Certain 
Linear Hypotheses and Their Application to Some Educational 
Problems.” gg poo Research Memoirs, 1:57-93; June, 1936. 

2 Robert H. Koenker and Carl W. Hansen. “Steps for the 
Application of the Johnson—Neyman Technique—A Sample 

of Experimental Education, 10:164~73; 


s John P. “Treacy. “The Relationship of Reading Skills to 
the Ability to Solve Arithmetic Problems.”’ Unpublished Ph. D. 
Dissertation, University of Minnesota, 1942. 

« Koenker and Hansen, op. cit., p. 164. 


ability in subtraction [and eleven other 
factors] when the effects of mental age and 
— age have been statistically con- 
tro 

One of the major advantages of this tech- 
nique is that it makes unnecessary the experi- 
menta! matching of the subjects on the two 
factors which are to be controlled or held 
constant. This is a distinct asset in that it 
greatly widens the field of usable data. It is 
sufficiently difficult to obtain adequate sam- 
ples when matching is done on only one 
factor; actual matching on two factors would 
frequently so restrict the samples as to pre- 
clude the demonstration of a statistically 
significant difference. The procedure has the 
further advantage of furnishing a graphical 
representation which shows the region within 
which the difference is significant. This fea- 
ture makes it especially valuable in prediction 
and guidance. 


THE FUNDAMENTAL STATISTICS REQUIRED 


The only fundamental statistics required 
for such an analysis are the means, standard 
deviations, and inter-correlations for the 
three measures involved—the one the signifi- 
cance of which is to be tested, and the two 
which are to be statistically controlled—for 
each of the two groups. The standard prac- 
tice is to represent the two measures held 
constant by x and y. The factor to be tested 
is indicated by z for one group [usually the 
better group, if that differentiation is implied 
in the constitution of the groups] and u for 
the other. Data involving only x and y, for 
the two groups, are distinguished by using 
the subscript 1, or the single prime (’) for 
the first group, and the subscript 2, or the 
double prime (”) for the second group. The 
data required are, therefore, the following: 


First 
Group 


21 Yi z 
0’, Oyo, 
, 


Second 


Standard Deviations __ 
Correlations Psy Tra fr 
Number of Cases N. N: 
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STEPS IN THE SOLUTION OF THE PROBLEM 


The steps required in the solution of the 
problem of determining the significance of 
the difference between the two groups [or of 
deciding whether to accept or to reject the 
hypothesis that the two groups could be 
chance samples of the same total population] 
when two additional factors are statistically 
controlled, or held constant, will be sum- 
marized briefly. At this point in the discus- 
sion the formulas will not be given—they are 
analyzed in the second section of this article. 
The only purpose here is to show the nature 
of the process. The steps are arranged in two 
parallel columns, the one showing the mathe- 
matical steps, and the other the graphical 
representation which results. 


A. MaTHEMaTIcaL STEPs 


2. Using the data of x 
and y only [the two 
factors to be held con- 
stant] find the weighting 
factor, P + Q. 


3. Using data in x and 
y, and also in the meas- 
ure to be tested (2, u), 
find the absolute mini- 
mum of the sum of the 
squares, S,’. 


4. By combining P + 
Q and S,’, find certain 
statistics: A’, B’, C’, D’, 
EB’, H’. 


5. Using the data of x 
and y, and also those in 
2, u, certain other sta- 
tistics are found, desig- 
nated as a, b, c. 


_6. From a combina- 
tion of the statistics a, 
b, c, and those of the 
type A’, B’, C’, etc., 
there are found a third 
set, designated a:, bi, c:. 


B. Grapuicat Repre- 
SENTATION 

1. With the x factor 
on the horizontal axis, 
and the y factor on the 
vertical axis, locate each 
individual, in each of 
the two groups, on the 
graph. 


4. From A’, B’, etc., 
are found the Co-ordi- 
nates of the Center of 
Accuracy, x» and Yo. 
This point is located on 
the graph. 


5. These statistics, a, 
b, c, determine the equa- 
tion of the Line of .Non- 
ignificance, which is 
located on the graph. 


6. The statistics a:, b: 
¢:, determine the Diame- 
ter of the Region of 
Significance, which is 
located on the graph. 
This must pass through 
the Co-ordinates of the 
Center of Accuracy, %o, 
Ye. 
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’ , B. GraPHicaL REPRE- 

A. MATHEMATICAL STEPS SENTATION 
7. By certain combi- 

nations of the three sets 

of statistics just found, 

Wors is computed and 

compared with w.a and 

Ws. If Worvs is smaller 

than wo, the hypothesis 

is rejected, and there is 

a significant difference at 

the 1% level. Similarly 

for the comparison with 

Wo. 


8. Certain other com- 
binations yield F, the 
best estimate of the true 
difference between the 
groups on the third 
factor (z, u) when the 
first two (x, y) are held 
constant. Also, Vr, the 
variance, and ¢@,, the 
standard error, of this 
true difference are found. 


9. By additional com- 
binations of the data, 
the Equation of the Re- 
gion of Significance is 
discovered, and the 
curve bounding this re- 
gion is drawn on the 
graph. 


Since a common type of problem in which 
this technique is applied is one in which two 
groups are to be compared on a number of 
different measures, with the same two factors 
held constant, the computations in Step A. 
2. may be carried out to the point of obtain- 
ing some partial results. These partial results 
may then be used in each of the comparisons, 
by simply combining them with S,*, which 
will differ for each comparison. This is true 
because the P + Q factor depends only on 
the two measures to be statistically con- 
trolled; the data for these will be the same for 
the two groups, no matter what third factor 
is to be compared. 


A Work SHEET FOR THE JOHNSON— 
NEYMAN TECHNIQUE 


As a result of the series of analysis. which 
will be developed in the second section of this 
article, it is possible to set up a Work Sheet 
which will greatly simplify the computations 
requited for the application of the Johnson— 
Neyman Technique. A sample problem, using 
this Work Sheet, is shown in Figure 1. The 
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resulting graphical representation is given in 
Figure 2. While the Work Sheet is. provided 
with for the recording of logarithms, 
it should be kept in mind that it will also 

simplify the process when a calculating 
machine is used. When logarithms are em- 
ployed, not all of the spaces need be filled. 
Thus, the results for y, ,, etc., are only 
intermediate steps; if the logarithms only are 
recorded, they may be used in later computa- 
tions without finding the anti-logs. This is 
also true of several other items. Conversely, 
D,, E,, H,, since they are the results of addi- 
tion, and are used only in addition, need not 
be changed to the logarithmic form. 

The illustrative example shown on the 
Work Sheet is based on the following prob- 
lem. For about two hundred fifty freshmen 
entering the College of Liberal Arts at Mar- 
quette University in the fall of 1942 the fol- 
lowing data were gathered: Rank in High 
School, Percentile Score on the American 
Council Psychological Examination, and vari- 
ous personality ratings furnished by the high 
schools. At the end of the first semester the 
average grade earned by each freshman was 
computed. The good group was made up of 
the eighty freshmen earning the highest aver- 
age grades, and the poor group of the eighty 
freshmen earning the lowest average grades. 
Since high school record and psychological 
test score have been shown previously to be 
fair predictors of college success, they were 
taken as the factors to be held constant—the 
former designated as x, and the latter as y. 
The two groups were then compared to see if 
there was a significant difference between 
them on the personality ratings, with these 
two measures statistically controlled. The 
particular rating used in the present illustra- 
tion is that of the estimate of “Probable Suc- 
cess”, which is designated as z for the good 
group, and # for the poor group. It was first 
determined by the use of the ¢-Test that the 
two groups differed significantly in average 
grade earned, and by the use of Snedecor’s 
F-Test that they were homogeneous with 
respect to Variance on that measure. 


Tue STEPs oF PROCEDURE 


The procedure as illustrated by the Work 
Sheet, and the graphical representation, is 
here summarized. The carrying out of each 
step is indicated on either the graph or the 
Work Sheet with the appropriate letter and 
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number referring to the step in this outline. 
The Work Sheet is made up of three pages, 
the first using data in x and y only, and the 
second and third including data for the par- 
ticular comparison, x, 


A. The General Data 


1. On the graph, with Rank in High School 
on the x-axis, and Psychological Percentile on 
the y-axis, locate each individual, indicating 
the good students by dots, and the poor stu- 
dents by x’s. 

2. On the first page of the Work Sheet, in 
the appropriate spaces, set down the means, 
standard deviations, and correlations, involv- 
ing x and y only, and NW for each group. Fill 
in the spaces for the logarithms, including 
logs of squares where needed, and log of 
I —Pf* xy. 

3. For each group separately, compute and 
record y, ,, ®,, ®,, and A. 

4. Find A,, B,, C,, D,, Z,, and H, for the 
good group, and A,, B., C:, D., E,, and H, 
for the poor group. 

5. Adding together like results, find A”, 
B”, c D”, 7. Ho”. 

6. Find M”. 

7. Find x,, y,, the coordinates of the Cen- 
ter of Accuracy, and locate this point on the 
graph. 

8. Find 7”. 


These eight steps complete the work on the 
general data. The results indicated may now 
be used in each of the comparisons to be 
made. 


B. The Particular Comparison 


1. Indicate on the second page of the Work 
Sheet, the means, standard deviations and 
inter-correlations for the two groups, sepa- 
rately, for the particular comparison to be 
made. Fill in the spaces for logarithms. 

2. Find, for the good group, dsx,y, dey.x, 
and K, and, for the poor group, 
and RK. 

3. Combine these to find a, 6, c. Substitute 
these values in the equation of the Line of 
Non-Significance. 

4. Substitute appropriate value of x’, find 

values of y’, and locate the 
line Oy on the graph. 

5. Find the separate halves of the formula 
for S,*, from the data for the two groups; 
add together to obtain S,’. 


Ux.y) eo) 
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6. Multiply A”, B”, etc., by S,* to find 
‘, BY, C’, D’, E’, H’, and T’ for the par- 
‘cular comparison. Multiply M” by (S,")? 

to a M. 

7. Find @,, 5,, c,. Substitute in the Equa- 
tion of the Diameter of the Region of Sig- 
nificance. 

§. Substitute appropriate values for x’, to 
find corresponding values of y’, and draw the 
Diameter of the Region of Significance, of, 
on the graph. Note that this line must pass 
through the Center of Accuracy, x,, ¥,. It is, 
therefore, best to substitute x, in the formula 
at once, to see if y, results. This is a valuable 
check on the accuracy of the work to this 
point. 

C. The Test of Significance 


. Set down the value of » — s. 

. Find F. 

. Find Vr and Or. 

. Compute and record the values of: £, 
a, H, Rk’, k, €o, €,*, A, and p. 

5. Find ws, W.o,, and w,,. Compare to 
test significance. 

D. The Equation of the Region 
of Significance 


1. Find C, », and p’. 


2. Find w, and determine the nature of the 
curve of the Region of Significance from its 
sign. 

3. Find « > o: hyperbola 


ete 


= 


ear or 


4. Find 








pens Aes) 


6. Find the zero points for »: 
é=+ ve—— 
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7. Substitute appropriate values for é, find 
corresponding values for », and draw the 
curve on the graph. 


INTERPRETATION 


In the illustrative example the Region of 
Significance is the hyperbola. The zero points 
for » are found to be: 


€= 17.6 é = — 28.6 


Both branches of the hyperbola fall partly 


‘Within the area occupied by the data. The 


interpretation is as follows: 


1. For freshmen in approximately the 
upper half of the distribution in both high 
school record and psychological test score, 
those who were more successful in college had 
higher ratings on “Probable Success” than 
those who made poorer records. 


2. Among freshmen in approximately the 
lower forty percent on both high school rec- 
ords and psychological test score, those who 
were more successful in college had lower 
ratings on “Probable Success” than those who 
made poorer records. 


In other words, the ratings on “Probable 
Success” differentiated between more and less 
successful freshmen who were in the upper 
half of their high school class and ranked in 
the upper fifty percentile on the American 
Council Psychological Examination. All of 
these would probably be admitted to college 
anyway. The ratings differentiated negatively 
for those who stood in the lower forty percent 


w < o: ellipse 
(v= —w) 


p? — ve 
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of their high school ‘class, and in the lower 
forty percent on the American Council Psy- 
chological Examination. Few of this group 
would be admitted anyway, because of their 
low records. For all of the intermediate group 
—especially those who rated high on one of 
the measures and low on the other—where 
such a rating might be of value in guiding 
the Committee on Admissions—this estimate 
of “Probable Success” did not discriminate 
significantly between the probable good and 
poor students. 


Il. THE DERIVATION OF THE 
FORMULAS AS USED IN THE 
WORK SHEET 


THE ORIGINAL FORMULAS 


The first step, as indicated in the outline 
given above, is to find the weighting factor, 
P + Q, which depends only on the original 
data for x and y, the two factors which are 
to be statistically controlled. The formula for 
P + Q is given as follows: (J-N, p. 76, F. 
82)° 


P+Q=qtrt+s— — ro ( SS —+,)° 
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It will be observed that S,* comes out as a 
single number, since every term in the for. 
mula is. included in the original data employed 
in solving the problem. P + Q, on the other 
hand, will result in a series of terms involving 
<7, =. y”*, x’y’, and some which include 
neither x’ nor y’. When P+ Qis multiplied 
by S,? there will result a different series of 
terms in x’, y’, etc. It is from the coefficients 
of these terms that the statistics A’, B’, (’, 
etc., result. The formula is: (J—N, p. 8s, F 
105) 

S,? (P + Q) =A’x’? +.2B’x’y’ + C’y? 

+ 2D’x’ + 2E’y’ + H’ 


The expressions a, 5, c, are obtained from: 
(J-N, p. 85, F. 103) 

F(x’y’) =a+ bx’ + cy’ 
which is derived from: (J—N, p. 84, F. 102) 


F (x’y’) = (a,° + @,°x’ + a,°y’) — 
(5,° + 6,°x’ + b,°y’) 
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The second step is to obtain S,?, the Abso- 
lute Minimum of the Sum of the Squares. 





GO; 


F(z’) = fp tele 


o’ 


sy Tra Ou 
— f+ 5s 


The formula for S,’ is as follows: (J-N, p. 
75, F. 62) 
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=WN,o, 
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This has been defined by Koenker and Han- 
sen in more familiar terms: (K—H, pp. 165-67) 


ae —5] 
or — 5a) 


which will be recognized as merely the 
difference between two regression equations 


—r’ xy "xz Os 
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’ 
—r", Try Tu 
=< y o” 





2 
IP gy — Pgg® —Pyn* + 29’ xy Fx Ty 





I—?r’,,* 


Nas are 9” sy" ee Tax" ae Ty" + 27” sy Tru Tyu 





+ N, ¢,’ 


5 Throughout this section reference will be made in paren- 
theses to the formulas as given in the article by Johnson and 
— pt that b y Koenker and Hansen, previously cited. 

Thus, J-N wil et a article, and K-H the latter. 
For both the page referénce will be given, and for the former 
the numbered formula will also be indicated. 


1—r” x,” 


—the one predicting z from the best weight- 
ing of x and y, and the one predicting 
from the best weighting of the same two 
measures. 
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By setting the simpler form of this equation 


equal to zero, 

a + bx’ + cy’ == 0 
and locating appropriate values of x’ and y’, 
the Line of Non-Significance may be drawn 
on the graph. 

The Co-ordinates of the Center of Accu- 
racy are found by solving the simultaneous 
equations: (J—N, p. 85, F. 106) 

A's, + B’y. + D’ =o 
B’x, + C’y¥o+ E’ =0 


Koenker and Hansen have given the solution 
in the form: (K—H, p. 169) 
_ BE’—C’D’ 
M 
me B’D’ — A’E’ 
Yo ies M 


S_ = 


where: 
M = A’C’ — B” 


The third set of required statistics, a,, 5,, 
¢,, are given by the following formulas: (J-—N, 
p. 85, F. 109) 

a, = D’c — E’b; b, = A’c — B’b; 

c, = B’c—C’b 


When they are combined in this form: 
a, + b,x’ + cy’ =0 


they define the Diameter of the Region of 
Significance on the graph (K—H, pp. 169-70). 

The test of significance is made by com- 
paring w.», with w,,, and w,,,. The observed 
w is found as follows: (J—N, p. 85, F’s. 116, 
II, 112, 113, I15). 

we: AH 
Woes "(H+ & A) 





__ (A’C’ — B”) (b,c — bc,) 
. BF + 6 
H = D’x, + E’y, + H’ 
(b,c — bc,)? 
oe Er, 
bx, 0 
g~ stin tor 


To find w,,, and w,, (the values of w at 
which the hypothesis is to be accepted or 
rejected at the 1 percent level and the 5 per- 
cent level, respectively) Johnson and Neyman 
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make use of Pearson’s Tables of the Incom- 
plete Beta Functions, and a Supplementary 
Table which they furnish (J—N, p. 84, p. 92). 
They also point out the possibility of inter- 
preting the results of this particular technique 
by referring to Fisher’s Tables of 5 percent 
and 1 percent limits of z, with degrees of 
freedom nm, and n,. *(J-N, p. 90.) This ap- 
proach naturally suggests the use of Sne- 
decor’s Table for the Distribution of F’, 
which is a little easier to apply, and probably 
more familiar to most readers. Koenker and 
Hansen have given the formulas for the two 
levels of significance of w in this form: (K—H, 
p. 169) 
ee n—s 

* Fin F os 


where the F is found in Snedecor’s Table, 
which is entered with the following degrees 
of freedom: 


n, == 1; n, —=n—s—=—N,+ N,—6 


W.o5 = 


The best estimate of the true difference 
between the two groups, with the two 
measures statistically controlled, is given by: 
(K-H, p. 172) 


F—a-+ bx, + cy. 


[This F is not to be confused with Snedecor’s 
F; the symbol was employed in the original 
article, in which Snedecor’s table was not 
mentioned. | 

The Variance of this best estimate of the 
difference is given by: (J—N, p. 68, F. 40) 
S,? F? 
Ve de" (n _ $s) A a 
where 

F? 
~ (PQ). 
and the Standard Error of the difference is: 
or VVr 

The Equation of the curve bounding the 
Region of Significance is given by: (J-N, p. 
85, F’s. 110, 114) 


wk? @—A (é—£,)?—Cy#—H—o 


where 


S,? 


b,c — be, 
Cam e+e 


*R. A. Fisher. Statistical Methods for Research Workers, 


Table VI. 
7 George W. Snedecor. Statistical Methods, pp. 184-87. 
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and the other terms have been defined pre- 


It is to be noted that the variables in this 
formula are ¢ and », and that the w is that 
at the level at which significance has been 
found (w,,, Or w.o,), and not the observed w. 
The curve is located on the graph by using 
the Line of Non-Significance, 


a+ bx’ +cy—o 
as the axis for », and the Diameter of the 
Region of Significance, 

a, + b,7’+¢,y =o 
as the axis for é. 


SIMPLIFYING THE COMPUTATION oF P + Q 


The first suggestion for simplifying the 
computational procedures grows out of a con- 
sideration of the formula for P + Q. An ex- 








amination of this formula as given above 
reveals that it requires only original statistics 
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standpoint of making it possible to find cer. 
tain preliminary results, and to hold them for 
further use—by a slight rearrangement of the 
formula. It will be noted that the formula for 
P + Q as given above consists of two parts; 
one based on data for the first group, and the 
other for the second group. Aside from the 
changes in the subscripts and primes, the two 
parts of the formula are precisely the same. 
We will deal here, therefore, with one-half of 
the formula only; and for purposes of simpli 
fying the notation, eliminate the subscripts 
and primes. It is to be understood that the re. 
sults which will now be given apply separately 
to the data for the two groups. To obtain the 
final results, it is only necessary to compute 
separately for the two groups, and add to- 
gether the two partial results so found. 
Re-writing the half of the formula, without 
subscripts and primes, and expanding the 
terms within the parentheses, we have: 


= x’y’ +3— 


Ox Oy 


yx’ + 2 “Sy 


x Oy 


Collecting common terms, and removing the 
brackets, this becomes: 





(wa=saa)*"- 
N (1 —P’,y) ox" 
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containing x and y. A common situation in 
which the Johnson—Neyman technique is 
applied is one in which several comparisons 
are to be made, holding constant the same two 
factors, and using the same two groups of 
subjects. In such a case it is an advantage to 
obtain the values given by the P + Q formula 
for the general data first; and later to change 
them to whatever form may be required for 
The operations involved in finding P + Q 
may also be considerably simplified—from the 


xy) Oy 


I 
2 ~~ +) 


Obviously, the terms within the various 
parentheses are the contributions of half of 
the data to the statistics A’, B’, C’, etc. How- 
ever, since the results obtained by adding 
together the terms from the good and the 
poor groups would then be only the parts of 
the expressions coming from P + Q, suppose 
we call them A”, B”, C”, etc., to distinguish 
from the A’, B’, C’, etc., which belong to the 
data for one particular comparison. The half 
of A” which comes from the first group will 
be designated A,; that which comes from the 
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second group, A,, and so on for the other 
letters. 

It is, therefore, only necessary to compute 
the separate items within the parentheses of 
this last form of the formula in order to find 
A,, B,, C,, etc. But an examination of the 
various terms reveals that they have many 
common expressions. Thus, each will contain, 
in the denominator, the expression: 
N (1 —?*xy); several also contain o,*; sev- 
eral, oy’; and others, 0, ¢,. Since these ex- 
pressions recur, it will simplify the process of 
computation if they are found separately, and 
designated by some suitable symbols. There- 
fore, let: 


y=N (1 —?*,,) 


&, = f ox Gy 


o— vo," 
o,=y a," 


There is one expression, xy, Which occurs 
several times in the numerator, but always 
with the same denominator. Therefore, let: 


eee. 
®, 


With these symbols it is now possible to 
express A,, B,, etc., very simply. The single 
primes indicate that these statistics are for 
the first group. A,, B,, C,, etc., are found 
from the same type of statistics, differeniated 
by the double prime. 


x,? 12 - = I 
H=g-+ Ge ey, +H 


The advantage of such a method of ex- 
pression is that it is possible to compute y, 
®,, ®,, ®,, and A, writing down each result 
in a single term; and then to use these terms 
in the formulas for A,, B,, etc. The opera- 
tions which include primarily only multipli- 
cation and division can be carried out either 
by means of logarithms, or by the use of a 
calculating machine. There will be little dan- 
ger of confusion, since each operation is lim- 
ited in itself, and the partial results are set 
down and given definite names. 


After A,, A,; B,, B,; etc., have been found, 
the two similar results are simply added to- 
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gether to give A”, B”, etc. These latter terms 
will apply to all comparisons made on the 
same fundamental data—holding constant the 
same two factors. To change them to A’, B’, 
etc., for the particular comparisons, it is only 
necessary to multiply each one by S,?, as 
found for the data in each comparison. 


THe CENTER OF ACCURACY 
By reference to the formulas for x,, y., the 
coordinates of the Center of Accuracy, it will 
be noted that they involve only expressions 
of the type just indicated. 
B’E’ —C’D’ B’ D’ — A‘E’ 
Kz M ee ae 
where: M = A’C’ — B” 
This latter may be written: 
M = (S,’ A”)(S,? C”) — (S,* B”)? 
M = (S,*)?(A”C” — BY”) =(S,")? M” 
The coordinates x,, y., may also be written: 
(S,? B”)(S,? BE”) — (S,?2 C”)(S,? D”) 
ie (S.2)? M” 
B” E” —C”D” 
— M” 


and similarly for y,. In other words, the 
Center of Accuracy may be found from the 
terms obtained for the general data. It will 
be the same for all comparisons holding con- 
stant the two factors involved, x, and y. 








Xo 





THE VARIANCE AND THE STANDARD ERROR OF 
THE Best ESTIMATE OF THE DIFFERENCE 
BETWEEN THE GROUPS 


According to the formula given above, the 
Variance of the best estimate of the true dif- 
ference between the two groups will be fur- 
nished by: 
$2 F* 


vom os) a 


where: 
F? 


~ (P+ Q)o 


The denominator expression (P + Q), means 
simply the result obtained by substituting the 
values of x, and y, in the formula for P +- Q. 
Let us call this 7”’ (the double prime indicat- 
ing that it may be obtained from the general 
data). The formula is: 


S,” 
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T” = (P + Q). =A” x,? + 2B” x0 Yo 
+O” yo" + 2D” x, + 2E” y. + H” 
The formula for Vp may now be written: 
x ae 
(n—s) F* 
7” 
re 2 ad 
(m— $s) 


Vr 


If we now let: 7’ —S,? 7”, we can write: 
T° 
oe 
which is considerably simpler than the form 
originally given. This form also makes it 
unnecessary to find S,*. The expression, 7’, 
given above, is analogous to several others, in 
that the 7” is found for the general data, 
and when multiplied by S,? it yields 7’, for~ 
the particular comparison. 


FINDING THE LINE or NoN-SIGNIFICANCE 


Three other statistics which are required 
for the complete solution are a, 5, c, which 
are contained in the equation of the Line of 
Non-Significance: 


a+ bx’ +cy’=—o 


As defined by the formula given in a previ- 
ous section, this will easily be recognized as 
the difference between two regression equa- 
tions. These regression equations are the ones 
which predict the score on the particular 
measure to be tested, in terms of the two 
measures held constant, for the first and the 
second groups. Using the conventional sym- 
bols for the regression coefficients, this 
formula may be written: 


F (x’y’) oy (biz. x’ + Day x y’ + K,) 
eT (buxy x’ + Buy. y’ + K,) 


where: 
Tx: — T xs Ty, Ox 


bz: 
Jy r’ 2 
I xy 


tt. 


and similarly for the other regression coeffi- 
cients, by rotation of the subscripts, and 


K, “= han buy %, —s biyx y; 


and similarly for K,. The three statistics re- 
quired, a, b, and c, may then be expressed 
as follows: 
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a= K,—K, 
b= Oey ae buy 


C= bay.s —— Ouy.x 


The equation of the Line of Non-Signifi. 
cance may be re-arranged to make it more 
convenient for graphing: 


a + bx’ + cy’ =0 

cy’ = —a —bx’ 
a +b 

i ae 


or: 


In this form values of x’ may be substituted 
directly, and the corresponding values of 
discovered. The line may then be represented 


on the graph. 


THE FORMULA FOR we». 


The formula for w.,. was given above in 
the form: 
AH 


k® (A ,* + H) 





Wors = 


where: 
Mii (A’C’ — B’*) (b,c — bc,) 
b,? + ¢,° 
H = D’x, + E’y, + H’ 
pe (b,c — bc,)? 
b,? +c,’ 


gy St oe FOr 








Since the aim of the present analysis is to 
reduce to simple forms for calculation, and 
to name and set down separately for future 
use any terms which are repeated, it is well 
to examine these formulas. In the formula for 
A, we note first the expression (A’C’ — B”*), 
which we recognize as M, defined above. 
Since we have found M” as a step in deter- 
mining the coordinates of the Center of Accu- 
racy, from the general data, we can just as 
well write: M = (S,?)? M” and use this M 
in the formula for A. 

We note next that the term b,c — bc, 
occurs twice in this group of formulas. It 
also is found in the formula for C: 


b.c — be, 
oT eae 
which is needed later. Therefore, we define 


it as follows: 
B= b,c — be, 
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The term 5,* + c,” also occurs twice, so we 
i a= 6? + ¢,? 
The formula for A then becomes: 

= MB 


a 


We note also that through the use of the 
symbols just defined, we can re-write the 
formula for &?: 

p? 
b? = <—— 


As the numerator term of the formula for 
é, we find the expression @ +- bx, + cy.. But 
this is also the expression which has been 
defined above as F, the best estimate of the 
difference between the good and the poor 
groups. So we write: 


F = a@ + bx, + cyo 
F 


== 


In the denominator of the formula for 
Wp. we discover the form A é,? + H. It will 
be shown later that this term also occurs in 
the analysis of the equation for the Region 
of Significance, so we might as well name it 


now: 
ienas A &,? +H 
We have, therefore, as our formula for wos: 
Woos = Fo . 


THE EQUATION OF THE REGION 
OF SIGNIFICANCE 


The equation of the Region of Significance 
is given in the form: 
wh? @— A (é—§,)* —Cy?—H=o 


Koenker and Hansen in their article substi- 
tuted the values for each of the items involved 
in the equation, to six decimal places, and 
then proceeded to rearrange the equation to 
atrive at a convenient form for graphing. 
This procedure involved a great deal of 
arithmetic—multiplying and dividing with 
tather cumbersome numbers. It appears that 
at this point, again, a little algebraic manipu- 
lation might save a large amount of arith- 
metic. 

Expanding the equation, and collecting 
like terms, we have: 
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(wk? — A) # + 24 & € —C ’ — 
(A &,* + H) =o 


The coefficient of &, (w k* — A), will deter- 
mine the shape of the curve. When this ex- 
pression is positive, the region of significance 
will be an hyperbola; when it is negative, the 
region will be defined by an ellipse; when it 
is zero, the curve will be a parabola. 


THe HyPERBOLA 

Let us take first the hyperbola form, and 

let 
o=wk?—A 
We also note in the formula above, the ex- 
pression A ¢,? + H, which we have already 
defined as p. Let us introduce one other 
symbol: 
p= A & 


Substituting these symbols, and keeping only 
the terms in é and @ in the left-hand mem- 
ber, we have: 

of + ywéE=—Cry’ +p 


Dividing through by wo: 


ep 22g oy 4 & 


the square in the left-hand 


Completing 
member: 


e+ ates (2) Se +24 (2) 
Rewriting, and combining the constant terms: 
(+48) Sette 


Now, for the sake of simplicity, let: 
pw +p 
o* 


Substituting, and leaving only the constant 
term in the right-hand member: 


(¢+8)'-Ey =. 
Dividing through by «: 


(4) 


Keeping ia the »* term in the left-hand 
member: 
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(e+5) 


c 2 
we” 


Multiplying both sides by » 


Taking the square root of both sides: 
s/t (ite) 
Vv ee en 


which is the required equation of the hyper- 
bola, suitable for substituting values of é to 
find the corresponding values of » and to plot 
‘the curve. 

It should be observed in locating the curve 
on the graph that the Line of Non- 
Significance, 





4 
heute yeiieaee 
is taken as the axis for », and the Diameter 
of the Region of Significance, 


ha cn 
CG, C, 
as the axis for é. The positive direction for ¢ 
is taken away from the line oy, toward the 
Center of Accuracy. 

It is also to be noted that the zero points 
for » (the edge of the curve where it 
approaches most nearly to oy) will be such 


that 
(« +4) 
Oe 
€ 


in other words, the points at which the quan- 
tity under the radical is zero. These are, 
obviously, at the points where: 


y= — 


1) 
In other words, there will be two branches 
to the hyperbola. One will approach the line 
On at the point where 


f= +ve—* 
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2 
Since o eae and since w, p, and p are 


always positive, Ve will always be larger 
than e. Therefore, this area is the one favor. 
able to the first group—that is, the are 
within which the first group excell the second 
group on the measure compared. The other 


branch, approaching the line oy at the point 
where 


fe —ve—F 


will be the area favorable to the second group 
—that is, the area within which the second 
group excel the first group on the measure 
compared. 
THE ELLipPse 
Let us now consider the case in which 
wk?—A<o 
which gives the ellipse. Let 
y= — eo = —(w k*? — A) 
Substituting in the original equation, we 
have: 
v@—2aé——Cr’?—p 
Dividing through by v: 
a Cc p 


2 —2 -§— — — fF? — — 
v v v 


Completing the square of the left-hand 
member: 


e—2"e+ (#)——Sv—24(4) 
v v Vv v A 


Rewriting and combining constant terms: 
7 2 
et Sy 
(« v ) y »” 


Now, let: 
(Ep 


Pd 


Substituting, and leaving the constant term 
in the right-hand member: 


(¢—+) +ovme 


Dividing through by «’: 


ean: A Pe 
€ Cc 
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Keeping only the »* term in the left-hand 
member : 


M 


-(g)-£9 


Taking the square root of both sides 
ory ry 
2/5 7 _(¢-4) 
Pad 


which is the required equation of the ellipse. 
This curve is located on the graph in the 
same manner as the hyperbola, using oy and 
O¢ as the axes, and counting the direction of 
fin the same way. 
The zero points for » will be such that 





, 
€ 


_=— 


in other words, the points at which the quan- 
tity under the radical is zero. These are obvi- 
ously the points where 
f= tve4+" 
v 


In other words, the ellipse will lie between 
the points 


i—— ie and = + ve +© 
, and since both v and p 
are always positive, obviously (+) will 
always be larger than we, Therefore 
the ellipse will lie wholly on the positive side 


of the line on, and the entire ellipse will indi- 
cate an area favorable to the first group. The 
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broadest part of the ellipse will lie half way 
between these two points, or where 


== 
v 


: Since for that value of é, the quantity under 
the radical becomes (1-0), the diameter at 


this widest point will extend from 


ve 
mt fF 


THE PARABOLA 


There is one other possible (though not 
very probable) form of the curve, namely 
that which is found when 


o=—wk?—A—o 
In such a case the curve becomes a parabola. 


The equation is very much simpler, since the 
€ term drops out. This leaves: 


anE—C7y?’—p=—o 


Dividing through by 2, we have: 


re 
Cc 


gw 


Rearranging, to leave only the »? term in the 
left-hand member: 


Lc L 
2p v 2 


2p 


Multiplying both sides by + : 


-(2)(-4) 


Taking the square root of both sides: 


 / 2B —/ 
get i/ C i é ap 
Which is the required equation, in convenient 
form for graphing. This curve will be nearest 
the on axis when the quantity under the rad- 
ica) is zero, or when 


ae ee 
€ an ) 


Therefore this point (the zero point for ») is 


the point where: 
p 


—_ 


2p 





NOTE ON A TECHNIQUE IN THE APPLICATION 
OF THE TOLLEY-EZEKIEL METHOD OF HANDLING 
MULTIPLE-CORRELATION PROBLEMS 
CorNELIus H. SIEMENS 


University of California 
Berkeley, California 


A study was made recently to determine 
how well academic success in upper division 
engineering courses (normally junior and 
senior courses) could be predicted by using 
six known factors. The factors in the achieve- 
ment records of engineering students at the 


Pyy = 02" Bye 54. on + Pros Ois.c6. on 
Py = Pas byssg . on + ,76 





13.24... 


Pyn= Px Dy o.3 ‘ 


-.a + Pon by3.2 ++ 


in which P,, — zak X) Gg» jp. 





University of California that proved to be 
best were the grade-point averages (G. P. A.) 
in high school physics and mathematics com- 
bined, college physics, college mathematics, 
college chemistry, all lower division courses, 
and the first semester of upper division engi- 


are — of giving. To 

the Be method,’ a vari- 

usual partial and multiple corre- 

i que, was applied. Furthermore, 

aim of the study to reduce the 

time for calculations to a mini- 

mum sacmmaeae with acceptable results 
for practical purposes. 

The Tolley-Ezekiel method treats each 


| nee x, + Doris... aX 
+ Besse... 0 3 09. .-@-4) oe 
+C. (1) 


1 Tolley, H. and Ezekiel, M. J. B. “A Method of 
Handling afuitinic Correlation Problems.”’ Journal of the 
ae Statistical Association. XVIII (December, 1922), 
93-1003. 
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‘ 
Bins 


ce Pees 


The predicted criterion is X, and X,, X,,.. 
X, are the known variables; the 6’s and C 
are the constants to be determined. 


By applying the principle of least squares, 
the normal equations to be solved simultane. 
ously for the 5’s were found to be: 


oo Putnam. <. 
oo + Pee Biase. 


(m - 4) 


~ on 1) 


- + on? b 


a8.93 ... 


The normal equations above can be han- 
died more easily if the following two substi- 
tutions are made: 

Piy = 715 0 o; (since 7; an 2X Xs ; > j) 
Oo, Co 
(3) 
and by letting 


m, =— .by. 1... a) . (4) 
% 


As a result the equations (2) become 


+ ma. 


in which m, can be solved by applying the 
method of determinants and the 6’s, in turn, 
can be calculated from equation (4).’ Fur- 
thermore, the multiple correlation coefficient 
in terms of m’s has the equation: 

1 For a similar modification refer to Garrett, H. E. “A 
Modification of Tolley and Ezekiel’s Method ‘of | Handlin 
Multiple Correlation 


” Journal of Educational Psy- 
chology. XIX (January, 1928), 45-49, 


Tin = Ton M, + 73—nM, + .... 


a> 


PERE Sa 


3 
& 


ze 


PsE8 SSR EF 853 58 
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R?, (123...) = Mm, Fo, + Mz To2 
+ My Ton - 


By using six factors, the calculations for 
m, involve sixth-order determinants, made up 
of zero-order Pearsonian correlation coeffi- 
cients. To bring the work of calculation with- 
in feasible time limits the raw data (largely 
grade-point averages) were put into standard 
score form, thereby facilitating the calcula- 
tion of the many intercorrelations. Further- 
more, slide-rule computations were found to 
be sufficiently accurate in working out the 
determinants. The latter “short-cut” proved 
to be an immense time-saver, while at the 
same time producing results comparable to 
other prediction studies. It was found that 
“slide-rule” accuracy of two (or three) sig- 
nificant figures provided results commensu- 
rate with the demands of school guidance; 
the arithmetical work, furthermore, was 
greatly reduced and facilitated. Also, the 
method of obtaining the solution by using 
determinants suggests that the “serious 
errors” of Shuttleworth’ that may occur in 
the ordinary method of calculating the partial 
correlations could well be reduced. 

The resultant six-factor prediction equa- 
tions were applied to two hundred unselected 


to ascertain how well they would operate 
in practice. It was found that the correlation 
coefficient between predicted scores and actual 
scores was r == .883 + .o1, whereas the mul- 
tiple correlation coefficient was R — .8509, 


1Shuttleworth, Frank K. “‘A Note on the Arithmetical 


Accuracy of Partials Involved in Multiple R.” Journal of 
Educational Psychology. XX1 (May, 1930), 379-80. 
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with a “chance” R of .18. These two results 
are reasonably close when one considers the 
number and character of calculations involved 
in the slide-rule computation of sixth-order 
determinants and the additional fact that the 
scatter-diagram showed a slight curvilinear 
relationship between the predicted and actual 
scores in the low-value region. As compared 
with the P. E. of Estimate of .22, the P. E. of 
the distribution of differences between pre- 
dicted and actual scores was calculated to be 
.20. Thus a predicted grade-point average of 
1.50 carries with it a fifty-fifty chance that 
the actual score will lie in the range 1.30 to 
1.70. 

The above results and comparisons sub- 
stantiate the following conclusions: 

1. The method and degree of accuracy em- 
ployed in this study have arrived at results 
which are useful and compare favorably in 
reliability with other similar prediction 
studies. 

2. The prediction equations of six variables 
as determined by using only slide-rule accu- 
racy operate in practice as well as their theo- 
retical statistical characteristics indicate that 
they should. 

3- In an application of the modified 
Tolley—Ezekiel technique, forecasting upper 
division engineering scholarship from a com- 
bination of factors taken from a student’s 
academic record was found to be possible and 
feasible. The predicted scores are sufficiently 
good to warrant their use in the guidance of 
students and possibly in the administration 
of admissions and dismissals. 





